CSCI-S278: Applied Quantitative Finance and Machine Learning¶
Final Project¶
Harvard University
Summer 2024
Instructor: Dr Mark Antonio Awada
Submitted by: Bethun Bhowmik
Import required packages¶
Abstract¶
- 30 stocks based on their historic performance and market cap in the Technology, Finance and Energy sectors for analysis
- From the initial basket, 15 stocks with mean returns greater than 10% were filtered out as per my investment criteria and further analyzed
- Modelling was conducted on the filtered basket using Multiple Linear Regression (MLR) and Support Vector Regression (SVR)
- 9 pairs of stocks were identified as cointegrated suggesting potential trading opportunities where the price movements of those stocks may be predictably related over the long term
- The maximum sharpe ratio portfolio had a sharpe ratio of 0.76, annualized return of 29% and annualized volatity of 22% with a significant allocation to high-growth stocks and some bond ETFs to balance risk
- To further reduce sharpe ratio and volatility, the Efficient Frontier was analyzed. While the annualized return dropped from 29% to 11%, the sharpe ratio has also reduced from 0.76 to 0.64 and the annualized volatity drastically reduced from 22% to 7%
- The constructed portfolio portfolio has a VaR return of 1.66% and a CVaR return of 2.18%, which is a good indicator of minimized portfolio risk
- The average pairwise correlation of final constructed portfolio is 0.0118 indicating very low correlation and diversification
- The hedged portfolio PnL is promising and above investment criteria defined
- The Beta of the technology stocks basket is 1.3582, beta of the finance stocks basket is 1.1040; and the overall portfolio beta is 1.2492
Topics Explored¶
- Data Preprocessing
- Summary statistics and graphical analysis
- Stock Returns
- Stationarity Testing - ADF Test
- Feature Engineering
- Lasso Regression
- Principal Component Analysis
- Sharpe Ratio
- Modeling
- Simple Linear Regression (SLR)
- Multiple Linear Regression (MLR)
- Support Vector Regression (SVR)
- Cointegration
- Portfolio Asset Allocation
- Efficient Frontier
- Diversification Factor
- Value at Risk; Conditional Value at Risk
- Volatility, Hedged Portfolio PnL
Introduction¶
I shortlisted 30 stocks based on their historic performance and market cap in the Technology, Finance and Energy sectors for analysis by downloading their data from 2019-01-01. The stocks with mean returns greater than 1.1 are filtered out as per my investment criteria and further analyzed for portfolio creation
Tech Stocks:
- AAPL - Apple Inc.
- MSFT - Microsoft Corporation
- NVDA - NVIDIA Corporation
- AVGO - Broadcom Inc.
- ORCL - Oracle Corporation
- CRM - Salesforce, Inc.
- ADBE - Adobe Inc.
- AMD - Advanced Micro Devices, Inc.
- ACN - Accenture plc
- CSCO - Cisco Systems, Inc.
Finance Stocks:
- BRK-B - Berkshire Hathaway Inc. (Class B)
- JPM - JPMorgan Chase & Co.
- V - Visa Inc.
- MA - Mastercard Incorporated
- BAC - Bank of America Corporation
- SBBG - Silvergate Capital Corporation
- WFC - Wells Fargo & Company
- GS - The Goldman Sachs Group, Inc.
- MS - Morgan Stanley
- AXP - American Express Company
Energy Stocks:
- XOM - Exxon Mobil Corporation
- CVX - Chevron Corporation
- COP - ConocoPhillips
- EOG - EOG Resources, Inc.
- SLB - SLB (formerly Schlumberger Limited)
- EPD - Enterprise Products Partners L.P.
- MPC - Marathon Petroleum Corporation
- PSX - Phillips 66
- OXY - Occidental Petroleum Corporation
- ET - Energy Transfer LP
Import neccessary libraries and modules¶
We import the neccessary libraries and modules required for the project.
import yfinance as yf
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler, Normalizer
from statsmodels.tsa.stattools import adfuller
from pandas.plotting import lag_plot
from statsmodels.tsa.stattools import adfuller
from statsmodels.tsa.seasonal import seasonal_decompose
from random import seed
from random import random
from scipy.stats import norm
from statsmodels.graphics.tsaplots import plot_acf
import pandas.plotting as pd_plotting
from IPython.display import display
from scipy.cluster import hierarchy
from scipy.spatial import distance
from sklearn.linear_model import Lasso
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA
from statsmodels import regression, stats
import statsmodels.api as sm
from scipy.cluster.hierarchy import dendrogram, linkage
import pylab
from sklearn.impute import SimpleImputer
from sklearn.svm import SVR
from sklearn.svm import SVC
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
from sklearn.pipeline import make_pipeline
from statsmodels.tsa.stattools import coint, adfuller
from sklearn.model_selection import TimeSeriesSplit , cross_val_score
from sklearn.cluster import KMeans, DBSCAN
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score
from sklearn.manifold import TSNE
from matplotlib import cm
import scipy.optimize as sco
from sklearn.ensemble import RandomForestClassifier
from sklearn.covariance import LedoitWolf
from statsmodels.stats.stattools import jarque_bera
import warnings
warnings.filterwarnings("ignore")
Data Preprocessing¶
The data preprocessing workflow involved retrieving and combining stock tickers from the technology, finance, and energy sectors, followed by downloading their key statistics using the yfinance library. The data was then cleaned by converting all feature columns to numeric values, with non-numeric entries coerced to NaN, and missing values were imputed using the mean of each feature. The feature data was subsequently normalized to ensure consistency across all variables, with each feature scaled to have a mean of 0 and a standard deviation of 1. Finally, the original key statistics were saved to a CSV file, and a new DataFrame with the normalized data was created and inspected, ensuring the dataset was prepared for analysis.
tech_stocks = ['AAPL', 'MSFT', 'NVDA', 'AVGO', 'ORCL', 'CRM', 'ADBE', 'AMD', 'ACN', 'CSCO']
finance_stocks = ['BRK-B', 'JPM', 'V', 'MA', 'BAC', 'SBBG', 'WFC', 'GS', 'MS', 'AXP']
energy_stocks = ['XOM', 'CVX', 'COP', 'EOG', 'SLB', 'EPD', 'MPC', 'PSX', 'OXY', 'ET']
# Combine all tickers into a single list
all_tickers = tech_stocks + finance_stocks + energy_stocks
stock_data = yf.download(all_tickers, start="2019-01-01")
print(stock_data.head())
[*********************100%%**********************] 30 of 30 completed
Price Adj Close \ Ticker AAPL ACN ADBE AMD AVGO Date 2019-01-02 37.793777 129.528809 224.570007 18.830000 21.319847 2019-01-03 34.029247 125.106438 215.699997 17.049999 19.423426 2019-01-04 35.481922 129.971008 226.190002 19.000000 19.614336 2019-01-07 35.402954 130.422394 229.259995 20.570000 20.013796 2019-01-08 36.077839 133.720764 232.679993 20.750000 19.853174 Price ... \ Ticker AXP BAC BRK-B COP CRM ... Date ... 2019-01-02 88.261612 21.812504 202.800003 52.238148 135.162781 ... 2019-01-03 86.538803 21.462948 191.660004 51.249107 130.027466 ... 2019-01-04 90.438271 22.354324 195.199997 52.543110 137.565887 ... 2019-01-07 90.929169 22.336843 196.910004 52.312328 141.813721 ... 2019-01-08 91.373772 22.293144 196.309998 53.012909 145.303726 ... Price Volume \ Ticker MSFT NVDA ORCL OXY PSX SBBG SLB Date 2019-01-02 35329300 508752000 14320400 5355300 3147800 8 15926100 2019-01-03 42579100 705552000 19868700 5465500 3182000 0 20007900 2019-01-04 44060600 585620000 20984000 6367400 3307200 0 19506600 2019-01-07 35656100 709160000 17967900 5771600 3474000 0 15736900 2019-01-08 31514400 786016000 16255700 5420100 2357300 0 13070200 Price Ticker V WFC XOM Date 2019-01-02 8788000 20295200 16727200 2019-01-03 9428300 22262000 13866100 2019-01-04 11065800 23343600 16043600 2019-01-07 12928000 21858000 10844200 2019-01-08 9243000 19702900 11439000 [5 rows x 180 columns]
key_stats_list = []
# Retrieve key statistics for each ticker
for ticker in all_tickers:
try:
stock = yf.Ticker(ticker)
key_stats = stock.info
key_stats['Ticker'] = ticker
key_stats_list.append(key_stats)
except Exception as e:
print(f"Could not retrieve data for {ticker}: {e}")
key_stats_df = pd.DataFrame(key_stats_list)
# Perform forward fill to handle missing data
key_stats_df.ffill(inplace=True)
# Save the DataFrame to a CSV file
key_stats_df.to_csv("key_statistics.csv", index=False)
print(key_stats_df.head())
address1 city state zip country \
0 One Apple Park Way Cupertino CA 95014 United States
1 One Microsoft Way Redmond WA 98052-6399 United States
2 2788 San Tomas Expressway Santa Clara CA 95051 United States
3 3421 Hillview Ave Palo Alto CA 94304 United States
4 2300 Oracle Way Austin TX 78741 United States
phone website industry \
0 408 996 1010 https://www.apple.com Consumer Electronics
1 425 882 8080 https://www.microsoft.com Software - Infrastructure
2 408 486 2000 https://www.nvidia.com Semiconductors
3 650 427 6000 https://www.broadcom.com Semiconductors
4 (737) 867-1000 https://www.oracle.com Software - Infrastructure
industryKey industryDisp ... revenueGrowth \
0 consumer-electronics Consumer Electronics ... 0.049
1 software-infrastructure Software - Infrastructure ... 0.152
2 semiconductors Semiconductors ... 2.621
3 semiconductors Semiconductors ... 0.164
4 software-infrastructure Software - Infrastructure ... 0.033
grossMargins ebitdaMargins operatingMargins financialCurrency \
0 0.45962 0.34175 0.29556 USD
1 0.69764 0.52804 0.43143 USD
2 0.75286 0.61768 0.64925 USD
3 0.74244 0.49957 0.31765 USD
4 0.71407 0.40080 0.33261 USD
trailingPegRatio Ticker address2 fax industrySymbol
0 2.1001 AAPL NaN NaN NaN
1 2.1709 MSFT NaN NaN NaN
2 1.1924 NVDA NaN NaN NaN
3 1.0732 AVGO NaN NaN NaN
4 1.9084 ORCL NaN NaN NaN
[5 rows x 136 columns]
# Select the feature columns from csv file
features = key_stats_df.columns[34:132]
data_features = key_stats_df[features]
X = data_features.values
print(data_features.head())
dividendRate dividendYield exDividendDate payoutRatio \ 0 1.00 0.0048 1.723421e+09 0.1476 1 3.00 0.0076 1.723680e+09 0.2483 2 0.04 0.0004 1.718064e+09 0.0094 3 2.10 0.0148 1.719187e+09 0.8488 4 1.60 0.0125 1.720656e+09 0.4313 fiveYearAvgDividendYield beta trailingPE forwardPE volume \ 0 0.68 1.244 31.541855 27.890982 69353188 1 0.91 0.894 33.807953 26.272846 24826640 2 0.11 1.680 60.964912 27.874332 386027827 3 2.85 1.188 62.034485 23.788430 24343940 4 1.52 1.014 34.557953 17.831710 5556125 regularMarketVolume ... returnOnEquity freeCashflow operatingCashflow \ 0 69353188 ... 1.60583 8.615812e+10 1.130410e+11 1 24826640 ... 0.37133 5.670525e+10 1.185480e+11 2 386027827 ... 1.15658 2.902375e+10 4.052400e+10 3 24343940 ... 0.22229 2.022288e+10 1.894200e+10 4 5556125 ... 1.93923 1.043725e+10 1.867300e+10 earningsGrowth revenueGrowth grossMargins ebitdaMargins \ 0 0.111 0.049 0.45962 0.34175 1 0.097 0.152 0.69764 0.52804 2 6.500 2.621 0.75286 0.61768 3 1.881 0.164 0.74244 0.49957 4 -0.063 0.033 0.71407 0.40080 operatingMargins financialCurrency trailingPegRatio 0 0.29556 USD 2.0592 1 0.43143 USD 2.1290 2 0.64925 USD 1.1428 3 0.31765 USD 1.0463 4 0.33261 USD 1.8921 [5 rows x 98 columns]
# Convert all feature columns to numeric, forcing errors to NaN
data_features = data_features.apply(pd.to_numeric, errors='coerce')
# Impute missing values using SimpleImputer
imputer = SimpleImputer(strategy='mean')
data_features_imputed = imputer.fit_transform(data_features)
# Normalize the raw feature data
scaler = StandardScaler()
data_features_N = scaler.fit_transform(data_features_imputed)
# Extract feature tickers
features_tickers = data_features.columns.values
# Save the DataFrame to a CSV file
key_stats_df.to_csv("key_statistics.csv", index=False)
# Display the head of the normalized DataFrame
data_features_N_df = pd.DataFrame(data_features_N, columns=features_tickers)
print(data_features_N_df.head())
dividendRate dividendYield exDividendDate payoutRatio \ 0 -0.783784 -1.064245 0.309385 -0.870566 1 0.107893 -0.915226 0.310678 -0.460452 2 -1.211789 -1.298418 0.282672 -1.433404 3 -0.293362 -0.532034 0.288273 1.985165 4 -0.516281 -0.654443 0.295597 0.284840 fiveYearAvgDividendYield beta trailingPE forwardPE volume \ 0 -0.945587 -0.159321 0.166085 1.702049 0.632785 1 -0.838574 -0.826867 0.247353 1.463111 -0.013827 2 -1.210791 0.672251 1.221269 1.699590 5.231514 3 0.064052 -0.266128 1.259626 1.096258 -0.020836 4 -0.554759 -0.597994 0.274250 0.216677 -0.293672 regularMarketVolume ... returnOnAssets returnOnEquity freeCashflow \ 0 0.632785 ... 1.402023 2.360884 3.365327 1 -0.013827 ... 0.592729 -0.033993 1.947319 2 5.231514 ... 4.147093 1.489358 0.614595 3 -0.020836 ... -0.171592 -0.323124 0.190877 4 -0.293672 ... -0.199571 3.007666 -0.280251 operatingCashflow earningsGrowth revenueGrowth grossMargins \ 0 2.496853 -0.339317 -0.188408 -0.316824 1 2.641829 -0.345631 0.021765 0.557220 2 0.587788 2.541994 5.059798 0.759996 3 0.019625 0.458917 0.046252 0.721732 4 0.012544 -0.417788 -0.221056 0.617553 ebitdaMargins operatingMargins trailingPegRatio 0 0.251472 0.156603 -0.283586 1 1.126601 0.861625 -0.277820 2 1.547699 1.991883 -0.359285 3 0.992858 0.271226 -0.367256 4 0.528870 0.348853 -0.297389 [5 rows x 84 columns]
Summary statistics and graphical analysis¶
On analyzing summary of the stocks by using describe() and also graphically plot the performance of the stocks from 2019-01-01 till 2014-08, it's observed that:
- The mean values for MSFT and NVDA are extremely high compared to other stocks. The standard deviations for these two stocks are also very large, indicating high volatility.
- The minimum and 25th percentile values for SBBG are 0, which is highly unusual. The closing price plot is also unusual and could indicate periods where the stock was inactive, delisted, or experienced extreme volatility.
- Boxplots for NVDA, AMD, AVGO, and PSX show several outliers which could be due to specific stock realated events or indicates that they are more volatile.
stock_data.describe()
| Price | Adj Close | ... | Volume | ||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Ticker | AAPL | ACN | ADBE | AMD | AVGO | AXP | BAC | BRK-B | COP | CRM | ... | MSFT | NVDA | ORCL | OXY | PSX | SBBG | SLB | V | WFC | XOM |
| count | 1408.000000 | 1408.000000 | 1408.000000 | 1408.000000 | 1408.000000 | 1408.000000 | 1408.000000 | 1408.000000 | 1408.000000 | 1408.000000 | ... | 1.408000e+03 | 1.408000e+03 | 1.408000e+03 | 1.408000e+03 | 1.408000e+03 | 1407.000000 | 1.408000e+03 | 1.408000e+03 | 1.408000e+03 | 1.408000e+03 |
| mean | 128.650285 | 258.436890 | 435.037429 | 87.411349 | 55.126213 | 144.190631 | 31.021857 | 279.887670 | 73.694817 | 204.438077 | ... | 2.854695e+07 | 4.619667e+08 | 1.073515e+07 | 1.752329e+07 | 3.241489e+06 | 0.646766 | 1.225496e+07 | 7.802563e+06 | 2.566035e+07 | 2.103017e+07 |
| std | 49.456658 | 63.908956 | 115.324439 | 41.625963 | 34.470324 | 39.173737 | 6.391193 | 67.969994 | 31.728508 | 46.068845 | ... | 1.218376e+07 | 1.956134e+08 | 6.648476e+06 | 1.414406e+07 | 1.425770e+06 | 11.702495 | 5.757358e+06 | 3.639250e+06 | 1.402453e+07 | 1.039236e+07 |
| min | 34.029243 | 125.106400 | 215.699997 | 17.049999 | 14.659641 | 64.769859 | 16.262651 | 162.130005 | 19.237530 | 123.944916 | ... | 8.989200e+06 | 9.788400e+07 | 2.168200e+06 | 2.446300e+06 | 1.081600e+06 | 0.000000 | 3.077600e+06 | 1.640900e+06 | 4.635500e+06 | 3.979400e+06 |
| 25% | 78.796825 | 196.848690 | 335.487503 | 54.027501 | 27.744335 | 111.542433 | 25.929496 | 213.924999 | 48.168697 | 161.033649 | ... | 2.089445e+07 | 3.258312e+08 | 6.747825e+06 | 8.528325e+06 | 2.366625e+06 | 0.000000 | 8.539125e+06 | 5.382500e+06 | 1.623218e+07 | 1.399320e+07 |
| 50% | 140.728851 | 270.918060 | 444.309998 | 85.430000 | 45.524628 | 148.024498 | 30.492549 | 280.789993 | 64.220913 | 202.175781 | ... | 2.568230e+07 | 4.331810e+08 | 8.894400e+06 | 1.353000e+07 | 2.941850e+06 | 0.000000 | 1.100995e+07 | 6.943200e+06 | 2.112420e+07 | 1.838580e+07 |
| 75% | 169.090652 | 307.366241 | 515.902527 | 110.285000 | 60.902423 | 165.524677 | 35.791312 | 323.797508 | 106.432148 | 241.392776 | ... | 3.269340e+07 | 5.596882e+08 | 1.242005e+07 | 2.167385e+07 | 3.680250e+06 | 0.000000 | 1.439680e+07 | 9.171350e+06 | 3.046582e+07 | 2.547872e+07 |
| max | 234.820007 | 398.653076 | 688.369995 | 211.380005 | 182.308105 | 253.039993 | 46.190937 | 445.609985 | 132.675964 | 315.974762 | ... | 9.701270e+07 | 2.511528e+09 | 6.860570e+07 | 1.346690e+08 | 1.657980e+07 | 300.000000 | 8.614800e+07 | 3.837960e+07 | 1.189526e+08 | 8.443940e+07 |
8 rows Ć 180 columns
fig, axs = plt.subplots(3, 1, figsize=(14, 15))
# Plot tech stocks
for ticker in tech_stocks:
axs[0].plot(stock_data['Close'][ticker], label=ticker)
axs[0].set_title('Tech Stocks Closing Prices')
axs[0].set_ylabel('Price')
axs[0].legend()
# Plot finance stocks
for ticker in finance_stocks:
axs[1].plot(stock_data['Close'][ticker], label=ticker)
axs[1].set_title('Finance Stocks Closing Prices')
axs[1].set_ylabel('Price')
axs[1].legend()
# Plot energy stocks
for ticker in energy_stocks:
axs[2].plot(stock_data['Close'][ticker], label=ticker)
axs[2].set_title('Energy Stocks Closing Prices')
axs[2].set_ylabel('Price')
axs[2].legend()
plt.tight_layout()
plt.show()
fig, axs = plt.subplots(3, 1, figsize=(14, 15))
# Plot log-transformed tech stocks
for ticker in tech_stocks:
axs[0].plot(np.log(stock_data['Close'][ticker]), label=ticker)
axs[0].set_title('Tech Stocks Closing Prices (Log Transformed)')
axs[0].set_ylabel('Log Price')
axs[0].legend()
# Plot log-transformed finance stocks
for ticker in finance_stocks:
axs[1].plot(np.log(stock_data['Close'][ticker]), label=ticker)
axs[1].set_title('Finance Stocks Closing Prices (Log Transformed)')
axs[1].set_ylabel('Log Price')
axs[1].legend()
# Plot log-transformed energy stocks
for ticker in energy_stocks:
axs[2].plot(np.log(stock_data['Close'][ticker]), label=ticker)
axs[2].set_title('Energy Stocks Closing Prices (Log Transformed)')
axs[2].set_ylabel('Log Price')
axs[2].legend()
plt.tight_layout()
plt.show()
# Extract close data for different sets of stocks
close_data_tech = stock_data['Close'][tech_stocks]
close_data_finance = stock_data['Close'][finance_stocks]
close_data_energy = stock_data['Close'][energy_stocks]
plt.figure(figsize=(18, 12))
# Boxplot for tech stocks
plt.subplot(3, 1, 1)
close_data_tech.boxplot()
plt.title('Boxplot of Closing Prices of Technology Stocks')
plt.xlabel('Stocks')
plt.ylabel('Closing Price')
plt.xticks(rotation=90)
plt.grid(True)
# Boxplot for finance stocks
plt.subplot(3, 1, 2)
close_data_finance.boxplot()
plt.title('Boxplot of Closing Prices of Finance Stocks')
plt.xlabel('Stocks')
plt.ylabel('Closing Price')
plt.xticks(rotation=90)
plt.grid(True)
# Boxplot for energy stocks
plt.subplot(3, 1, 3)
close_data_energy.boxplot()
plt.title('Boxplot of Closing Prices of Energy Stocks')
plt.xlabel('Stocks')
plt.ylabel('Closing Price')
plt.xticks(rotation=90)
plt.grid(True)
plt.tight_layout()
plt.show()
Stock Returns¶
The mean returns for the stocks are calculated and shown in a descending order along with their standard deviations over a 15-day window for each of the three sectors.
def calculate_returns(stock_list, data):
for stock in stock_list:
stock_prices = data[stock].fillna(method='ffill')
returns_dict[stock] = get_returns(stock_prices, 15)
returns_means[stock] = np.mean(returns_dict[stock])
returns_sds[stock] = np.std(returns_dict[stock])
# Calculate returns for tech, finance, and energy stocks
calculate_returns(tech_stocks, stock_data['Close'])
calculate_returns(finance_stocks, stock_data['Close'])
calculate_returns(energy_stocks, stock_data['Close'])
returns_stats_tech_df = pd.DataFrame({
'Stock': tech_stocks,
'Mean Return': [returns_means[stock] for stock in tech_stocks],
'Standard Deviation': [returns_sds[stock] for stock in tech_stocks]
})
returns_stats_finance_df = pd.DataFrame({
'Stock': finance_stocks,
'Mean Return': [returns_means[stock] for stock in finance_stocks],
'Standard Deviation': [returns_sds[stock] for stock in finance_stocks]
})
returns_stats_energy_df = pd.DataFrame({
'Stock': energy_stocks,
'Mean Return': [returns_means[stock] for stock in energy_stocks],
'Standard Deviation': [returns_sds[stock] for stock in energy_stocks]
})
print("Tech Stocks Mean Returns and Standard Deviations")
print(returns_stats_tech_df.sort_values(by='Mean Return', ascending=False))
print("\nFinance Stocks Mean Returns and Standard Deviations")
print(returns_stats_finance_df.sort_values(by='Mean Return', ascending=False))
print("\nEnergy Stocks Mean Returns and Standard Deviations")
print(returns_stats_energy_df.sort_values(by='Mean Return', ascending=False))
Tech Stocks Mean Returns and Standard Deviations Stock Mean Return Standard Deviation 1 MSFT 3.392337 15.344852 6 ADBE 3.215391 37.699295 8 ACN 1.936867 17.861691 0 AAPL 1.930573 9.157680 3 AVGO 1.343918 5.853680 7 AMD 1.293742 11.887166 2 NVDA 1.151718 4.938479 5 CRM 1.138695 17.615138 4 ORCL 0.943376 5.736916 9 CSCO 0.034065 2.822704 Finance Stocks Mean Returns and Standard Deviations Stock Mean Return Standard Deviation 7 GS 3.267713 21.616423 3 MA 2.712939 19.810422 0 BRK-B 2.517914 12.804207 9 AXP 1.534251 10.958976 2 V 1.343326 10.741866 1 JPM 1.138452 8.864121 8 MS 0.620989 5.570682 4 BAC 0.143821 2.635675 6 WFC 0.094530 3.504981 5 SBBG -67.741398 319.382442 Energy Stocks Mean Returns and Standard Deviations Stock Mean Return Standard Deviation 6 MPC 1.143548 8.425291 7 PSX 0.502072 8.590547 0 XOM 0.481183 5.441699 2 COP 0.472186 7.298642 1 CVX 0.440581 9.156263 3 EOG 0.308824 9.044866 4 SLB 0.061147 4.048081 5 EPD 0.024681 1.609299 9 ET 0.017864 0.968930 8 OXY -0.058595 4.979771
Stocks with mean returns > 10%¶
The stocks with mean returns greater than 1.1 are filtered out as per my investment criteria
returns_stats_df = pd.DataFrame({
'Stock': list(returns_means.keys()),
'Mean Return': list(returns_means.values()),
'Standard Deviation': list(returns_sds.values())
})
# Filter out stocks with mean return less than 1.10
filtered_stocks_df = returns_stats_df[returns_stats_df['Mean Return'] >= 1.10]
# Sort the filtered DataFrame by Mean Return in descending order
filtered_stocks_df = filtered_stocks_df.sort_values(by='Mean Return', ascending=False)
print(filtered_stocks_df)
Stock Mean Return Standard Deviation 1 MSFT 3.392337 15.344852 17 GS 3.267713 21.616423 6 ADBE 3.215391 37.699295 13 MA 2.712939 19.810422 10 BRK-B 2.517914 12.804207 8 ACN 1.936867 17.861691 0 AAPL 1.930573 9.157680 19 AXP 1.534251 10.958976 3 AVGO 1.343918 5.853680 12 V 1.343326 10.741866 7 AMD 1.293742 11.887166 2 NVDA 1.151718 4.938479 26 MPC 1.143548 8.425291 5 CRM 1.138695 17.615138 11 JPM 1.138452 8.864121
The histograms of the filtered list of top-performing stocks are visualized after normalizing the returns using Z-score normalization. We observe a bell-shaped, symmetric histogram centered around zero for all 15 stocks indicating that their returns are normally distributed. Stocks that have a wider spread suggests higher volatility.
filtered_stocks = filtered_stocks_df['Stock'].tolist()
# Filter the original stock data to include only the filtered stocks
filtered_close_data = stock_data['Close'][filtered_stocks]
# Calculate returns for filtered stocks
filtered_returns = {stock: get_returns(filtered_close_data[stock].fillna(method='ffill'), 15) for stock in filtered_stocks}
# Apply Z-score normalization to the returns
scaler = StandardScaler()
normalized_returns = {stock: scaler.fit_transform(np.array(filtered_returns[stock]).reshape(-1, 1)).reshape(-1) for stock in filtered_stocks}
# Plot normalized returns in a 5x4 grid
fig, axes = plt.subplots(5, 4, figsize=(20, 20))
axes = axes.flatten()
for i, stock in enumerate(filtered_stocks):
axes[i].hist(normalized_returns[stock], bins=50, alpha=0.75, color='blue')
axes[i].set_title(f'Normalized Returns for {stock}')
axes[i].set_xlabel('Normalized Return')
axes[i].set_ylabel('Frequency')
# Hide any unused subplots
for ax in axes[len(filtered_stocks):]:
ax.set_visible(False)
plt.tight_layout()
plt.show()
Correlation analysis shows that all stocks are positively correlated with values tanging to 0.28 to 0.99. Most stocks have a moderate to high correlation except ADBE and MPC which has a correlation of 0.28.
# Plot the correlation matrix for the filtered stocks
correlation_matrix = filtered_close_data.corr()
plt.figure(figsize=(16, 12))
sns.heatmap(correlation_matrix, annot=True, fmt=".2f", cmap="coolwarm", vmin=-1, vmax=1)
plt.title('Correlation Matrix of Stocks with Mean Return > 1.1')
plt.show()
Autocorrelation scatter plot for all filtered stocks indicates a strong autocorrelation, suggesting that each stockās price on one day is strongly related to its prices on the previous days. All filtered stocks seem to be trending and momentum strategy could be effective.
# Autocorrelation scatter plot analysis for the filtered stocks
fig, axes = plt.subplots(5, 4, figsize=(20, 20))
axes = axes.flatten()
for i, stock in enumerate(filtered_stocks):
lag_plot(filtered_close_data[stock].dropna(), lag=1, ax=axes[i])
axes[i].set_title(f'Autocorrelation Scatter Plot for {stock}')
# Hide any unused subplots
for ax in axes[len(filtered_stocks):]:
ax.set_visible(False)
plt.tight_layout()
plt.show()
All stocks show a good probability of increasing and providing a likelihood of achieving a return > 20 over a 50 day period. Only AVGO and NVDA does not meet my investment criteria of mean return > 10% during the 50 day period. However, I would like to keep both AVGO and NVDA in the basket for analysis.
# Initialize lists to store mean and standard deviation of returns over different investment periods
all_returns_means = {stock: [] for stock in filtered_stocks}
all_returns_sds = {stock: [] for stock in filtered_stocks}
# Calculate mean and standard deviation of returns for 50 day period
for stock in filtered_stocks:
for invest_time in range(1, 50):
returns = get_returns(filtered_close_data[stock].fillna(method='ffill'), days_out=invest_time)
all_returns_means[stock].append(np.mean(returns))
all_returns_sds[stock].append(np.std(returns))
# Plot probabilities for exceeding a return of 20 for each stock over different investment periods
fig, axes = plt.subplots(5, 4, figsize=(20, 20))
axes = axes.flatten()
for i, stock in enumerate(filtered_stocks):
prob_list = []
for j in range(len(all_returns_means[stock])):
prob = 1 - norm.cdf(20, all_returns_means[stock][j], all_returns_sds[stock][j])
prob_list.append(prob)
axes[i].plot(prob_list)
axes[i].set_title(f'Prob of exceeding return of 20 for {stock}')
axes[i].set_xlabel('Investment Period (Days)')
axes[i].set_ylabel('Probability')
for ax in axes[len(filtered_stocks):]:
ax.set_visible(False)
plt.tight_layout()
plt.show()
# Calculate probabilities for exceeding a return of 20
probabilities = {stock: [] for stock in filtered_stocks}
for stock in filtered_stocks:
for j in range(len(all_returns_means[stock])):
prob = 1 - norm.cdf(20, all_returns_means[stock][j], all_returns_sds[stock][j])
probabilities[stock].append(prob)
# Calculate the maximum probability for each stock
max_probabilities = {stock: max(probabilities[stock]) for stock in filtered_stocks}
# Sort stocks by the highest probabilities
sorted_stocks = sorted(max_probabilities.items(), key=lambda x: x[1], reverse=True)
# Print the sorted stocks
print("Filtered Stocks Sorted by Highest Probability of Exceeding a Return of 20:")
for stock, prob in sorted_stocks:
print(f"{stock}: {prob:.4f}")
Filtered Stocks Sorted by Highest Probability of Exceeding a Return of 20: ADBE: 0.4387 GS: 0.3883 MSFT: 0.3690 MA: 0.3511 ACN: 0.3150 CRM: 0.2961 BRK-B: 0.2902 AMD: 0.2236 AXP: 0.2052 AAPL: 0.1961 V: 0.1690 JPM: 0.1473 MPC: 0.1450 AVGO: 0.0533 NVDA: 0.0451
Stationarity Testing - ADF Test¶
All the filtered stocks have highly negative ADF statistics, with values ranging from around -5.91 to -7.25 which suggests strong evidence of Stationarity. It's observed that the returns are not random walks and all stocks maintain consistent statistical properties over time. The consistency of these results across different sectors suggests that market dynamics influencing these stocks contribute to a stable return profile.
# Function to perform ADF test
def adf_test(timeseries):
result = adfuller(timeseries)
print('ADF Statistic: %f' % result[0])
print('p-value: %f' % result[1])
# Perform ADF test and plot the results in grids of 3
fig, axes = plt.subplots(len(filtered_stocks), 3, figsize=(15, 5 * len(filtered_stocks)))
axes = axes.flatten()
for i, stock in enumerate(filtered_stocks):
rw_returns = filtered_returns[stock]
print(f'ADF Test for {stock} returns:')
adf_test(rw_returns)
axes[i*3].plot(rw_returns, label=f'{stock} Returns')
axes[i*3].set_title(f'{stock} Returns')
axes[i*3].legend()
# Random walk simulation
seed(1)
random_walk = []
random_walk.append(-1 if random() < 0.5 else 1)
for j in range(1, 1000):
movement = -1 if random() < 0.5 else 1
value = random_walk[j-1] + movement
random_walk.append(value)
axes[i*3+1].plot(random_walk, label='Random Walk')
axes[i*3+1].set_title('Random Walk')
axes[i*3+1].legend()
pd_plotting.autocorrelation_plot(random_walk, ax=axes[i*3+2])
axes[i*3+2].set_title('Autocorrelation of Random Walk')
for ax in axes[len(filtered_stocks)*3:]:
ax.set_visible(False)
plt.tight_layout()
plt.show()
ADF Test for MSFT returns: ADF Statistic: -7.000522 p-value: 0.000000 ADF Test for GS returns: ADF Statistic: -7.252001 p-value: 0.000000 ADF Test for ADBE returns: ADF Statistic: -6.407761 p-value: 0.000000 ADF Test for MA returns: ADF Statistic: -7.021395 p-value: 0.000000 ADF Test for BRK-B returns: ADF Statistic: -6.829493 p-value: 0.000000 ADF Test for ACN returns: ADF Statistic: -6.898409 p-value: 0.000000 ADF Test for AAPL returns: ADF Statistic: -6.735005 p-value: 0.000000 ADF Test for AXP returns: ADF Statistic: -6.066305 p-value: 0.000000 ADF Test for AVGO returns: ADF Statistic: -6.234501 p-value: 0.000000 ADF Test for V returns: ADF Statistic: -6.849880 p-value: 0.000000 ADF Test for AMD returns: ADF Statistic: -6.052814 p-value: 0.000000 ADF Test for NVDA returns: ADF Statistic: -7.182806 p-value: 0.000000 ADF Test for MPC returns: ADF Statistic: -6.274607 p-value: 0.000000 ADF Test for CRM returns: ADF Statistic: -5.912784 p-value: 0.000000 ADF Test for JPM returns: ADF Statistic: -6.096113 p-value: 0.000000
Feature Engineering¶
Most stocks have moderately scaled features, with values closer to zero, indicating that their financial and technical indicators are within a normal range, relative to their historical data. AMD and AVGO show high positive values for f1 (close-to-open return) and NVDA has the highest value for f10 (Z-score of closing prices), indicating that its closing price is significantly above its mean.
The hierarchical clustering dendrogram below provides a visual representation of the relationships between features based on their correlations. It's observed that f3 and f7 as well as f4 and f5 provide similar information and we can reducing the feature set by selecting one feature from both groups to avoid multicollinearity.
# Filter the original tech_stocks DataFrame to include only the filtered stocks
filtered_stock_data = stock_data.loc[:, (slice(None), filtered_stocks)]
# Feature Engineering
data_transform = filtered_stock_data.stack(level=1)
features = pd.DataFrame(index=data_transform.index).sort_index()
# Calculate features
features['f1'] = data_transform['Close'] / data_transform['Open'] - 1
features['f2'] = data_transform['Open'] / data_transform.groupby(level=1)['Close'].shift(1) - 1
features['f3'] = data_transform['Volume'].apply(np.log)
features['f4'] = data_transform.groupby(level=1)['Volume'].diff()
diff_50 = lambda x: x.diff(50)
features['f5'] = data_transform.groupby(level=1)['Volume'].apply(diff_50).sort_index(level='Date', axis=0).droplevel(0)
pct_chg_fxn = lambda x: x.pct_change()
features['f6'] = data_transform.groupby(level=1)['Volume'].apply(pct_chg_fxn).sort_index(level='Date', axis=0).droplevel(0)
ma_5 = lambda x: x.rolling(5).mean()
features['f7'] = data_transform.groupby(level=1)['Volume'].apply(ma_5).apply(np.log).sort_index(level='Date', axis=0).droplevel(0)
ma_200 = lambda x: x.rolling(200).mean() - 1
features['f8'] = data_transform['Volume'] / data_transform.groupby(level=1)['Volume'].apply(ma_200).sort_index(level='Date', axis=0).droplevel(0)
ema_50 = lambda x: x.ewm(span=50).mean() - 1
features['f9'] = data_transform['Close'] / data_transform.groupby(level=1)['Close'].apply(ema_50).sort_index(level='Date', axis=0).droplevel(0)
zscore_fxn = lambda x: (x - x.mean()) / x.std()
features['f10'] = data_transform.groupby(level=1)['Close'].apply(zscore_fxn).sort_index(level='Date', axis=0).droplevel(0)
features['f10'].unstack().plot.kde(title='Z-Scores')
plt.show()
print(features.tail())
f1 f2 f3 f4 f5 \
Date Ticker
2024-08-08 MA 0.007385 0.005335 14.276261 -474537.0 -1085537.0
MPC 0.024412 0.003838 14.431803 98648.0 -528952.0
MSFT 0.000721 0.009964 16.770303 -1453158.0 3479742.0
NVDA 0.029219 0.031139 19.773302 -24685314.0 -265972914.0
V 0.006586 0.006276 15.464718 -7705744.0 -1009244.0
f6 f7 f8 f9 f10
Date Ticker
2024-08-08 MA -0.230380 14.894516 0.623566 1.018084 1.817965
MPC 0.056261 14.637521 0.674100 1.017785 1.894379
MSFT -0.070368 17.110919 0.851567 0.940253 1.652688
NVDA -0.059997 19.921231 0.869428 0.934388 2.913131
V -0.596947 16.237610 0.766653 0.978246 1.431532
# Apply StandardScaler to the features
std_scaler = StandardScaler()
features_scaled = std_scaler.fit_transform(features.dropna())
df_scaled = pd.DataFrame(features_scaled, index=features.dropna().index, columns=features.dropna().columns)
# Print the last few rows of the scaled DataFrame
print(df_scaled.tail(15))
f1 f2 f3 f4 f5 f6 \
Date Ticker
2024-08-08 AAPL 0.027515 1.021149 0.929630 -0.468616 -0.110455 -0.503036
ACN 0.073314 -0.620449 -1.140396 -0.013236 -0.024025 -0.418452
ADBE 0.993924 0.882670 -1.027206 0.011624 -0.005302 0.291079
AMD 1.776348 1.719028 0.985938 -0.149258 -0.256779 -0.241245
AVGO 1.764844 2.393767 0.602534 -0.073377 0.056833 -0.226805
AXP 0.566081 0.628802 -1.286510 -0.030500 -0.011696 -0.741386
BRK-B 0.267326 0.313882 -0.909604 -0.054040 -0.009849 -0.733971
CRM 1.304728 0.866415 -0.698135 -0.012003 -0.081870 -0.260953
GS 0.883381 0.658743 -1.198521 -0.023453 -0.003147 -0.607601
JPM 0.404871 0.662465 -0.394793 -0.087061 -0.016665 -0.622663
MA 0.379093 0.323149 -1.212482 -0.011269 -0.014335 -0.407764
MPC 1.317857 0.221435 -1.112593 0.002897 -0.006299 -0.008641
MSFT 0.011623 0.637648 0.389193 -0.035456 0.051583 -0.184961
NVDA 1.582889 2.076351 2.317720 -0.609655 -3.839028 -0.170521
V 0.335001 0.387094 -0.449254 -0.189993 -0.013234 -0.918177
f7 f8 f9 f10
Date Ticker
2024-08-08 AAPL 1.308348 -0.550694 -0.281507 1.602402
ACN -0.964237 -0.683457 -0.323049 0.607716
ADBE -0.976996 -0.707346 -0.335857 0.674437
AMD 1.163429 -0.451041 -1.686687 1.049550
AVGO 0.692396 -0.373184 -0.843167 2.602059
AXP -0.678226 -1.074358 -0.530098 2.187578
BRK-B -0.516889 -0.540937 -0.091369 2.178171
CRM -0.458867 -0.908457 -0.565191 0.814590
GS -0.762472 -0.598449 0.001365 2.146716
JPM 0.040986 -0.769796 -0.304681 2.333219
MA -0.845464 -0.757714 -0.178871 1.743342
MPC -1.012686 -0.656671 -0.182449 1.825575
MSFT 0.596714 -0.301824 -1.110599 1.565478
NVDA 2.425338 -0.266110 -1.180803 2.921908
V 0.028465 -0.471610 -0.655780 1.327480
# Calculate the correlation matrix
corr_matrix = df_scaled.corr()
correlations_array = np.asarray(corr_matrix)
# Perform hierarchical clustering
linkage = hierarchy.linkage(distance.pdist(correlations_array), method='average')
# Plot the cluster map
g = sns.clustermap(corr_matrix, row_linkage=linkage, col_linkage=linkage,
row_cluster=True, col_cluster=True, figsize=(10, 10), cmap='Blues')
plt.setp(g.ax_heatmap.yaxis.get_majorticklabels(), rotation=0)
plt.show()
label_order = corr_matrix.iloc[:, g.dendrogram_row.reordered_ind].columns
# Generate the hierarchical clustering dendrogram
Z = linkage(corr_matrix, 'average')
plt.figure(figsize=(25, 10))
labelsize = 20
ticksize = 15
plt.title('Hierarchical Clustering Dendrogram for Feature Selection', fontsize=labelsize)
plt.xlabel('Features', fontsize=labelsize)
plt.ylabel('Distance', fontsize=labelsize)
dendrogram(
Z,
leaf_rotation=90., # rotates the x axis labels
leaf_font_size=8., # font size for the x axis labels
labels=corr_matrix.columns
)
pylab.yticks(fontsize=ticksize)
pylab.xticks(rotation=-90, fontsize=ticksize)
plt.savefig('dendrogram_Feature_Selection.png')
plt.show()
Lasso Regression¶
- The Lasso regression for each stock identified f10 (the Z-score) as the dominant feature for predicting closing prices.
- All other features (f1 to f9) were deemed insignificant and thus received coefficients equal to zero.
rows = len(filtered_stocks) // 5 + (1 if len(filtered_stocks) % 5 != 0 else 0)
fig, axes = plt.subplots(rows, 5, figsize=(20, 5 * rows))
axes = axes.flatten()
for i, stock in enumerate(filtered_stocks):
stock_features = df_scaled.xs(stock, level='Ticker').iloc[0:-1]
close_prices = filtered_stock_data['Close'][stock][filtered_stock_data['Close'][stock].index >= '2020-10-16']
# Ensure the indices align for the Lasso regression
common_index = stock_features.index.intersection(close_prices.index)
stock_features = stock_features.loc[common_index]
close_prices = close_prices.loc[common_index]
# Fit the Lasso model
lasso = Lasso()
lasso.fit(stock_features, close_prices)
# Print the Lasso coefficients
print(f'Lasso Coefficients for {stock}:')
print(lasso.coef_)
# Scatter plot of the predictions vs actual close prices
axes[i].scatter(lasso.predict(stock_features.iloc[-100:]), close_prices.iloc[-100:])
axes[i].set_xlabel('Predicted Prices')
axes[i].set_ylabel('Actual Prices')
axes[i].set_title(f'Lasso Predictions vs Actual Prices for {stock}')
# Hide any unused subplots
for ax in axes[len(filtered_stocks):]:
ax.set_visible(False)
plt.tight_layout()
plt.show()
Lasso Coefficients for MSFT: [ 0. 0. -0. -0. -0. 0. -0. -0. 0. 80.89595679] Lasso Coefficients for GS: [ 0. 0. -0. 0. 0. -0. -0. 0. 0. 75.04839359] Lasso Coefficients for ADBE: [ 0. 0. -0. -0. -0. -0. -0. -0. 0. 105.93142107] Lasso Coefficients for MA: [ 0. 0. -0. -0. -0. -0. -0. 0. 0. 55.66812595] Lasso Coefficients for BRK-B: [ 0. -0. -0. -0. 0. -0. -0. 0. 0. 61.88946648] Lasso Coefficients for ACN: [ 0. 0. 0. 0. 0. 0. 0. 0. 0. 55.31763005] Lasso Coefficients for AAPL: [ 0. 0. -0. -0. 0. 0. -0. 0. 0. 42.30230772] Lasso Coefficients for AXP: [ 0. -0. -0. -0. -0. -0. -0. 0. 0. 33.8091874] Lasso Coefficients for AVGO: [-0. 0. 0. -0. 0. 0. 0. 0. 0. 30.24816648] Lasso Coefficients for V: [ 0. 0. -0. -0. -0. -0. -0. 0. 0. 29.35582917] Lasso Coefficients for AMD: [ 0. 0. 0. 0. 0. 0. 0. 0. 0. 37.23681318] Lasso Coefficients for NVDA: [-0. 0. 0. -0. -0. -0. 0. -0. 0. 24.46192753] Lasso Coefficients for MPC: [-0. -0. -0. -0. -0. -0. -0. -0. -0. 41.97298818] Lasso Coefficients for CRM: [-0. 0. -0. -0. -0. 0. -0. -0. 0. 41.9195327] Lasso Coefficients for JPM: [ 0. -0. -0. 0. 0. -0. -0. 0. 0. 25.47116204]
Principal Component Analysis¶
To further explore the effect of different features, PCA is done on all the features for the stocks downloaded from yahoo finance during the data preprocessing and stored in key_statistics dataset. It's observed that 15 principal components explain 95.00% of variance. Moreover the eigen-portfolio weights were plotted to visually represent the effect of different features.
# Perform PCA
pca = PCA(random_state=84)
pca.fit(data_features_N)
PCA(random_state=84)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
PCA(random_state=84)
# Function to plot PCA explained variance
def plotPCA(plot=False):
var_threshold = 0.95
var_explained = np.cumsum(pca.explained_variance_ratio_)
num_comp = np.where(np.logical_not(var_explained < var_threshold))[0][0] + 1
if plot:
print('%d principal components explain %.2f%% of variance' % (num_comp, 100 * var_threshold))
# PCA percent variance explained.
bar_width = 0.9
n_asset = var_explained.shape[0]
x_indx = np.arange(n_asset)
fig, ax = plt.subplots()
# Eigenvalues measured as percentage of explained variance.
rects = ax.bar(x_indx, pca.explained_variance_ratio_[:n_asset], bar_width)
ax.set_xticks(x_indx + bar_width / 2)
ax.set_xticklabels(list(range(n_asset)), rotation=45)
ax.set_title('Percent variance explained')
ax.set_ylabel('Explained Variance')
ax.set_xlabel('Principal Components')
plt.show()
plotPCA(plot=True)
15 principal components explain 95.00% of variance
pcs = pca.components_
pcs.shape
(30, 84)
# Plot cumulative explained variance
plt.figure()
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('Number of Components')
plt.ylabel('Variance (%)') # for each component
plt.title(' Explained Variance of Features Dataset for filtered stocks with Mean Return > 1.10 ')
plt.show()
def PCWeights():
weights = pd.DataFrame()
for i in range(len(pcs)):
weights["weights_{}".format(i)] = pcs[i, :] / sum(pcs[i, :])
weights = weights.values.T
return weights
weights = PCWeights()
portfolio = portfolio = pd.DataFrame()
def plotEigen(weights, plot=False, portfolio=portfolio):
data_feature_N ={'weights': weights.squeeze()}
#portfolio = pd.DataFrame(data_feature_N ={'weights': weights.squeeze()*100}, index = features_tickers)
portfolio = pd.DataFrame(data_feature_N, index = features_tickers)
portfolio.sort_values(by=['weights'], ascending=False, inplace=True)
if plot:
print('Sum of weights of current eigen-portfolio: %.2f' % np.sum(portfolio))
ax = portfolio.plot(title='Current Eigen-Portfolio Weights',
figsize=(20,8),
xticks=range(0, len(features_tickers),1),
rot=80,
linewidth=5
)
plt.tick_params(labelsize = 10)
plt.xlabel('Features', fontsize=10)
plt.grid(True)
plt.title('Eigen-Portfolio Features Weights ',fontdict={'fontsize':10})
legend = ax.legend(loc=0, ncol=1, bbox_to_anchor=(0, 0, 1,1),fancybox=True,shadow=False,title='Feature',prop={'size':10})
plt.setp(legend.get_title(),fontsize='10')
plt.show()
return portfolio
# Eigen-Portfolio Features Weights for PC0
plotEigen(weights=weights[0], plot=True)
Sum of weights of current eigen-portfolio: 1.00
| weights | |
|---|---|
| priceToSalesTrailing12Months | 0.041007 |
| enterpriseToRevenue | 0.038891 |
| forwardPE | 0.037596 |
| returnOnAssets | 0.037534 |
| marketCap | 0.036009 |
| ... | ... |
| bidSize | -0.026761 |
| askSize | -0.027178 |
| trailingAnnualDividendYield | -0.030380 |
| fiveYearAvgDividendYield | -0.030535 |
| dividendYield | -0.030988 |
84 rows Ć 1 columns
# Eigen-Portfolio Features Weights for PC1
plotEigen(weights=weights[1], plot=True)
Sum of weights of current eigen-portfolio: 1.00
| weights | |
|---|---|
| targetLowPrice | 0.189828 |
| fiftyTwoWeekLow | 0.188580 |
| bid | 0.186828 |
| twoHundredDayAverage | 0.186676 |
| currentPrice | 0.186587 |
| ... | ... |
| averageVolume | -0.120394 |
| averageDailyVolume10Day | -0.122629 |
| averageVolume10days | -0.122629 |
| sharesShortPriorMonth | -0.133092 |
| sharesShort | -0.133342 |
84 rows Ć 1 columns
# Create DataFrame of principal component weights
listpca = []
for k in range(len(pcs)):
temp = 'PC{}'.format(k)
listpca.append(temp)
pcs_df = pd.DataFrame(pcs, columns=data_features_N_df.columns, index=listpca).T
print(pcs_df)
PC0 PC1 PC2 PC3 PC4 \
dividendRate -0.048379 -0.130982 -0.020105 -0.021320 0.197945
dividendYield -0.149468 0.061498 -0.140336 0.038130 0.105972
exDividendDate -0.027124 0.014076 -0.245937 -0.026177 -0.018955
payoutRatio -0.117763 0.039783 -0.145802 0.010546 0.024336
fiveYearAvgDividendYield -0.147285 0.076689 -0.156519 -0.001907 0.068507
... ... ... ... ... ...
revenueGrowth 0.155595 0.093477 -0.060585 -0.071897 0.216512
grossMargins 0.094378 -0.041329 0.128437 -0.097652 -0.075300
ebitdaMargins 0.120918 0.006686 -0.052787 -0.117143 -0.151845
operatingMargins 0.140785 -0.063509 -0.108683 -0.108845 0.044952
trailingPegRatio -0.030288 0.004814 -0.015103 -0.032012 0.045075
PC5 PC6 PC7 PC8 PC9 \
dividendRate -0.157050 -0.149759 -0.140911 -0.236856 0.024036
dividendYield -0.002725 0.150093 0.018042 -0.097587 -0.119056
exDividendDate -0.224770 0.002458 -0.164308 0.104952 0.055360
payoutRatio -0.031680 0.202601 0.075190 -0.088681 -0.129055
fiveYearAvgDividendYield 0.035719 0.161989 -0.037889 -0.135477 -0.059661
... ... ... ... ... ...
revenueGrowth -0.047463 0.047818 -0.045491 -0.013412 0.017357
grossMargins -0.222813 -0.050302 0.099113 0.164329 0.003796
ebitdaMargins -0.033804 0.048992 -0.068854 0.246931 -0.101479
operatingMargins -0.163169 0.002744 0.066693 0.227730 -0.103223
trailingPegRatio -0.014909 -0.114944 -0.167734 0.280771 -0.100171
... PC20 PC21 PC22 PC23 \
dividendRate ... -0.082749 -0.097275 0.124855 -0.102519
dividendYield ... -0.112359 0.119084 0.050798 0.024250
exDividendDate ... 0.108800 0.064336 -0.040363 0.008877
payoutRatio ... 0.145035 -0.003125 -0.299832 -0.013908
fiveYearAvgDividendYield ... -0.044933 0.009756 0.135016 0.366561
... ... ... ... ... ...
revenueGrowth ... 0.031829 0.031749 -0.008011 0.156973
grossMargins ... -0.164450 0.022109 -0.217426 0.088494
ebitdaMargins ... 0.067797 -0.297834 0.133059 -0.070754
operatingMargins ... -0.072586 0.079647 -0.014682 0.110551
trailingPegRatio ... 0.249647 -0.112351 0.014765 0.034876
PC24 PC25 PC26 PC27 PC28 \
dividendRate 0.092744 -0.112563 -0.157720 -0.003112 0.109221
dividendYield -0.071480 -0.085306 0.067945 0.025914 0.057856
exDividendDate -0.021438 -0.133997 0.003257 -0.065362 -0.050976
payoutRatio -0.072947 -0.101622 -0.027105 -0.130891 0.017275
fiveYearAvgDividendYield -0.020263 -0.083643 0.160487 -0.226531 0.033801
... ... ... ... ... ...
revenueGrowth 0.026528 -0.082986 -0.157205 0.114624 0.216012
grossMargins 0.174646 -0.034326 0.190523 0.057691 -0.185854
ebitdaMargins 0.053702 -0.070199 0.042529 0.142212 -0.078967
operatingMargins 0.129262 0.004023 -0.137815 0.048256 -0.009850
trailingPegRatio -0.081021 -0.115694 -0.081133 -0.167371 0.037547
PC29
dividendRate -0.095440
dividendYield -0.429226
exDividendDate 0.114266
payoutRatio 0.055607
fiveYearAvgDividendYield 0.017510
... ...
revenueGrowth -0.043651
grossMargins -0.078146
ebitdaMargins 0.047463
operatingMargins 0.036036
trailingPegRatio 0.013398
[84 rows x 30 columns]
Sharpe Ratio¶
This below code finds the most optimized portfolio, in terms of risk-adjusted return (Sharpe Ratio), from a set of 30 portfolios that are constructed using the principal components of the key_statistics dataset for filtered stocks
Eigen Portfolio #1 was identified as having the highest Sharpe Ratio, but both the Return and Sharpe Ratio are displayed as NaN, indicating that there was an issue in the key_statistics dataset.
NOTE: Moving forward the key_statistics dataset will not be used for modelling analysis due to data discrepancy and undefined values.
# Sharpe Ratio
def sharpe_ratio(ts_returns, periods_per_year=252):
n_years = ts_returns.shape[0] / periods_per_year
annualized_return = np.power(np.prod(1 + ts_returns), (1 / n_years)) - 1
annualized_vol = ts_returns.std() * np.sqrt(periods_per_year)
annualized_sharpe = annualized_return / annualized_vol
return annualized_return, annualized_vol, annualized_sharpe
# Optimized Portfolio
def optimizedPortfolio():
n_portfolios = len(pcs)
annualized_ret = np.array([0.] * n_portfolios)
sharpe_metric = np.array([0.] * n_portfolios)
annualized_vol = np.array([0.] * n_portfolios)
highest_sharpe = 0
highest_sharpe_weights = None
for i in range(n_portfolios):
pc_w = pcs[i, :] / sum(pcs[i, :])
data_feature_N = {'weights': pc_w.squeeze()}
eigen_prtfi = pd.DataFrame(data_feature_N, index=features_tickers)
eigen_prtfi.sort_values(by=['weights'], ascending=False, inplace=True)
eigen_prti_returns = np.dot(data_features_N, eigen_prtfi['weights'])
eigen_prti_returns = pd.Series(eigen_prti_returns, index=data_features_N_df.index)
er, vol, sharpe = sharpe_ratio(eigen_prti_returns)
annualized_ret[i] = er
annualized_vol[i] = vol
sharpe_metric[i] = sharpe
if sharpe > highest_sharpe:
highest_sharpe = sharpe
highest_sharpe_weights = eigen_prtfi
highest_sharpe_index = np.argmax(sharpe_metric)
print(f'Eigen portfolio #{highest_sharpe_index} with the highest Sharpe. Return {annualized_ret[highest_sharpe_index]*100:.2f}%, vol = {annualized_vol[highest_sharpe_index]*100:.2f}%, Sharpe = {sharpe_metric[highest_sharpe_index]:.2f}')
print('Top features in the highest Sharpe portfolio:')
print(highest_sharpe_weights.head(10))
fig, ax = plt.subplots()
fig.set_size_inches(12, 4)
ax.plot(sharpe_metric, linewidth=3)
ax.set_title('Sharpe ratio of eigen-portfolios')
ax.set_ylabel('Sharpe ratio')
ax.set_xlabel('Portfolios')
results = pd.DataFrame(data={'Return': annualized_ret, 'Vol': annualized_vol, 'Sharpe': sharpe_metric})
results.dropna(inplace=True)
results.sort_values(by=['Sharpe'], ascending=False, inplace=True)
print(results.head(10))
plt.show()
optimizedPortfolio()
Eigen portfolio #1 with the highest Sharpe. Return nan%, vol = 1551.30%, Sharpe = nan
Top features in the highest Sharpe portfolio:
weights
fiveYearAvgDividendYield 2.379390
debtToEquity 2.000594
totalRevenue 1.714917
ebitda 1.612266
enterpriseToRevenue 1.084224
numberOfAnalystOpinions 1.053090
revenueGrowth 1.018931
recommendationMean 0.978078
priceToSalesTrailing12Months 0.974780
shortPercentOfFloat 0.723518
Return Vol Sharpe
23 6.830871e+182 141.240288 4.836348e+180
22 2.633811e+97 61.617070 4.274483e+95
29 1.488836e+95 63.171098 2.356831e+93
28 5.455202e+87 63.461357 8.596101e+85
19 5.586205e+65 54.850169 1.018448e+64
20 6.415324e+52 44.654393 1.436661e+51
18 7.357409e+49 48.284172 1.523773e+48
7 2.186805e+13 28.714243 7.615750e+11
27 -1.000000e+00 25.984649 -3.848426e-02
12 -1.000000e+00 23.306952 -4.290565e-02
Modeling¶
Simple Linear Regression¶
We choose GS (Goldman Sachs) and ADBE (Adobe) for simple linear regression. The SLR R-square of 0.4333 indicates that 43.33% of the variance in the closing prices of GS can be explained by the closing prices of ADBE using the linear regression model.
# Tickers
stock_x_ticker = 'ADBE'
stock_y_ticker = 'GS'
# Get the closing prices
x_data = stock_data['Close'][stock_x_ticker].dropna()
y_data = stock_data['Close'][stock_y_ticker].dropna()
# Align the indices
common_index = x_data.index.intersection(y_data.index)
x_data = x_data.loc[common_index]
y_data = y_data.loc[common_index]
# Perform Simple Linear Regression
slr = sm.OLS(y_data, sm.add_constant(x_data)).fit()
slr_prediction = slr.params[0] + slr.params[1] * x_data
print(f'SLR R-squared: {slr.rsquared}')
print(f'SLR Adjusted R-squared: {slr.rsquared_adj}')
plt.figure(figsize=(10, 6))
plt.scatter(x_data, y_data, label='Actual Data')
plt.plot(x_data, slr_prediction, color='red', label='Fitted Line')
plt.xlabel(stock_x_ticker)
plt.ylabel(stock_y_ticker)
plt.title(f'Simple Linear Regression between {stock_x_ticker} and {stock_y_ticker}')
plt.legend()
plt.show()
SLR R-squared: 0.4332853110364281 SLR Adjusted R-squared: 0.432882814808471
Multiple Linear Regression¶
- From the MLR output, R-square of 0.915 indicates that approximately 91.5% of the variance in Adobe's (ADBE) closing prices can be explained by the model using the selected basket of 14 stocks as predictor variables.
- The high F-statistic (1067) and the associated p-value of 0 indicate that the overall regression model is statistically significant and the model's predictors are related to the ADBE closing prices.
- The high Condition Number of 11,600 suggests that there might be multicollinearity among the predictor variables.
- The low Durbin-Watson statistic suggests potential autocorrelation in the residuals, which may indicate that some time-dependent structure is not captured by the model.
# Define the target variable (ADBE)
target_stock = 'ADBE'
y_data = closing_prices[target_stock].dropna()
# Define the predictor variables (all other stocks)
x_data = closing_prices.drop(columns=[target_stock]).dropna()
# Align the indices
common_index = y_data.index.intersection(x_data.index)
x_data = x_data.loc[common_index]
y_data = y_data.loc[common_index]
# Perform Multiple Linear Regression
X = sm.add_constant(np.column_stack((AAPL_close, MSFT_close, NVDA_close, AVGO_close, ORCL_close, CRM_close, AMD_close, ACN_close, CSCO_close, BRKB_close, JPM_close, V_close, MA_close, BAC_close)))
mlr = sm.OLS(y_data, X).fit()
print(f'MLR ADBE and Filtered Basket - R-squared: {mlr.rsquared}')
print(f'MLR ADBE and Filtered Basket - Adjusted R-squared: {mlr.rsquared_adj}')
print(mlr.summary())
# Predict using the MLR model
mlr_prediction = mlr.predict(X)
plt.figure(figsize=(12, 8))
plt.plot(y_data, label='ADBE Actual')
plt.plot(y_data.index, mlr_prediction, label='MLR Prediction', color='red')
plt.xlabel('Date')
plt.ylabel('ADBE Closing Price')
plt.title(f'Multiple Linear Regression for ADBE using Filtered Stock Basket')
plt.legend()
plt.show()
MLR ADBE and Filtered Basket - R-squared: 0.9147368952268922
MLR ADBE and Filtered Basket - Adjusted R-squared: 0.9138799796010317
OLS Regression Results
==============================================================================
Dep. Variable: ADBE R-squared: 0.915
Model: OLS Adj. R-squared: 0.914
Method: Least Squares F-statistic: 1067.
Date: Fri, 09 Aug 2024 Prob (F-statistic): 0.00
Time: 04:06:40 Log-Likelihood: -6948.9
No. Observations: 1408 AIC: 1.393e+04
Df Residuals: 1393 BIC: 1.401e+04
Df Model: 14
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const -91.9516 14.773 -6.224 0.000 -120.931 -62.972
x1 0.7230 0.093 7.806 0.000 0.541 0.905
x2 0.5874 0.096 6.094 0.000 0.398 0.777
x3 0.2788 0.245 1.139 0.255 -0.201 0.759
x4 -0.1793 0.225 -0.798 0.425 -0.620 0.262
x5 0.4350 0.205 2.122 0.034 0.033 0.837
x6 1.2542 0.052 24.334 0.000 1.153 1.355
x7 -0.8751 0.127 -6.875 0.000 -1.125 -0.625
x8 0.8828 0.087 10.182 0.000 0.713 1.053
x9 2.3370 0.294 7.955 0.000 1.761 2.913
x10 -1.3736 0.068 -20.059 0.000 -1.508 -1.239
x11 0.4377 0.160 2.727 0.006 0.123 0.752
x12 0.8918 0.200 4.465 0.000 0.500 1.284
x13 -0.2623 0.108 -2.423 0.016 -0.475 -0.050
x14 -1.6687 0.480 -3.479 0.001 -2.610 -0.728
==============================================================================
Omnibus: 34.354 Durbin-Watson: 0.059
Prob(Omnibus): 0.000 Jarque-Bera (JB): 40.072
Skew: -0.322 Prob(JB): 1.99e-09
Kurtosis: 3.518 Cond. No. 1.16e+04
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.16e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
Support Vector Regression (SVR)¶
- The SVR model using AAPL's closing prices provides a moderate prediction of ADBE's prices based on the R-square values. The model only explains 52.6% of the variance on the training set and 49.1% on the test set. The close MSE and MAE values between the training and test sets suggest that the model does not suffer from severe overfitting and generalizes relatively well to unseen data.
- Using the basket of selected stocks significantly improves the SVR model's ability to predict ADBE's closing prices compared to using a single stock like AAPL. The basket model explains 79.7% of the variance on the training set and 78.7% on the test set. The lower MSE ands MAE values compared to the single-stock model suggests improved accuracy in the model's predictions when using the basket of stocks.
# Define the target variable (ADBE)
target_stock = 'ADBE'
y_data = closing_prices[target_stock].dropna()
# Define the predictor variables (all other stocks)
x_data = closing_prices.drop(columns=[target_stock]).dropna()
# Align the indices
common_index = y_data.index.intersection(x_data.index)
x_data = x_data.loc[common_index]
y_data = y_data.loc[common_index]
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(x_data, y_data, test_size=0.2, random_state=42)
# SVR with one stock (e.g., AAPL)
svr_model_single = make_pipeline(StandardScaler(), SVR(kernel='rbf', C=1.0, epsilon=0.2))
X_train_single = X_train[['AAPL']]
X_test_single = X_test[['AAPL']]
svr_model_single.fit(X_train_single, y_train)
y_train_pred_single = svr_model_single.predict(X_train_single)
y_test_pred_single = svr_model_single.predict(X_test_single)
# Evaluate the single stock model
train_mse_single = mean_squared_error(y_train, y_train_pred_single)
test_mse_single = mean_squared_error(y_test, y_test_pred_single)
train_mae_single = mean_absolute_error(y_train, y_train_pred_single)
test_mae_single = mean_absolute_error(y_test, y_test_pred_single)
train_r2_single = r2_score(y_train, y_train_pred_single)
test_r2_single = r2_score(y_test, y_test_pred_single)
print('SVR with Single Stock (AAPL):')
print(f'Train MSE: {train_mse_single}')
print(f'Test MSE: {test_mse_single}')
print(f'Train MAE: {train_mae_single}')
print(f'Test MAE: {test_mae_single}')
print(f'Train R-squared: {train_r2_single}')
print(f'Test R-squared: {test_r2_single}')
print()
SVR with Single Stock (AAPL): Train MSE: 6351.090442650765 Test MSE: 6545.172635117789 Train MAE: 59.08351976513713 Test MAE: 59.8118066194894 Train R-squared: 0.5258430089690345 Test R-squared: 0.49137295508659706
# Plot actual vs predicted values for testing sets for single stock (AAPL)
plt.figure(figsize=(14, 7))
plt.plot(y_test.index, y_test, label='Actual Test ADBE')
plt.plot(y_test.index, y_test_pred_single, label='SVR Prediction (AAPL)', alpha=0.7)
plt.xlabel('Date')
plt.ylabel('ADBE Closing Price')
plt.title('SVR for ADBE using AAPL')
plt.legend()
plt.show()
# SVR with the entire basket of stocks
svr_model_basket = make_pipeline(StandardScaler(), SVR(kernel='rbf', C=1.0, epsilon=0.2))
svr_model_basket.fit(X_train, y_train)
y_train_pred_basket = svr_model_basket.predict(X_train)
y_test_pred_basket = svr_model_basket.predict(X_test)
# Evaluate the basket model
train_mse_basket = mean_squared_error(y_train, y_train_pred_basket)
test_mse_basket = mean_squared_error(y_test, y_test_pred_basket)
train_mae_basket = mean_absolute_error(y_train, y_train_pred_basket)
test_mae_basket = mean_absolute_error(y_test, y_test_pred_basket)
train_r2_basket = r2_score(y_train, y_train_pred_basket)
test_r2_basket = r2_score(y_test, y_test_pred_basket)
print('SVR with Basket of Stocks:')
print(f'Train MSE: {train_mse_basket}')
print(f'Test MSE: {test_mse_basket}')
print(f'Train MAE: {train_mae_basket}')
print(f'Test MAE: {test_mae_basket}')
print(f'Train R-squared: {train_r2_basket}')
print(f'Test R-squared: {test_r2_basket}')
print()
SVR with Basket of Stocks: Train MSE: 2716.2138820025752 Test MSE: 2740.6975790822753 Train MAE: 39.99276524814664 Test MAE: 40.35452036488289 Train R-squared: 0.797214066951731 Test R-squared: 0.7870196878886682
# Plot actual vs predicted values for testing sets for basket of stocks
plt.figure(figsize=(14, 7))
plt.plot(y_test.index, y_test, label='Actual Test ADBE')
plt.plot(y_test.index, y_test_pred_basket, label='SVR Prediction (Basket)', alpha=0.7)
plt.xlabel('Date')
plt.ylabel('ADBE Closing Price')
plt.title('SVR for ADBE using Basket of Stocks')
plt.legend()
plt.show()
sns.pairplot(closing_prices)
plt.title('Pair Plot of Stock Prices')
plt.show()
Cointegration¶
9 pairs of stocks were identified as cointegrated suggesting potential trading opportunities where the price movements of these stocks may be predictably related over the long term. These pairs can be useful for pairs trading strategies
- AMD and MA appear in the most pairs, indicating that they have significant cointegrated relationships with several other stocks.
- MLR and Dickey-Fuller tests were performed on cointegrated pairs. The R-squared values for the MLR models range from 64.9% to 97.4% suggesting a strong linear relationship between the pairs.
- The ADF statistics for all pairs are negative, and the p-values are all below 0.05 providing strong evidence of cointegration between the pairs.
# Cointegration analysis
def find_cointegrated_pairs(data):
n = data.shape[1]
score_matrix = np.zeros((n, n))
pvalue_matrix = np.ones((n, n))
keys = data.keys()
pairs = []
for i in range(n):
for j in range(i + 1, n):
S1 = data[keys[i]]
S2 = data[keys[j]]
result = coint(S1, S2)
score = result[0]
pvalue = result[1]
score_matrix[i, j] = score
pvalue_matrix[i, j] = pvalue
if pvalue < 0.05:
pairs.append((keys[i], keys[j]))
return score_matrix, pvalue_matrix, pairs
# Perform cointegration analysis
scores, pvalues, pairs = find_cointegrated_pairs(closing_prices)
# Plot the heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(pvalues, xticklabels=stocks, yticklabels=stocks, cmap='RdYlGn_r', mask=(pvalues >= 0.99))
plt.title('Cointegration P-Values Heatmap')
plt.show()
# Print the cointegrated pairs
print("Cointegrated pairs:")
print(pairs)
Cointegrated pairs:
[('ADBE', 'CRM'), ('AMD', 'MA'), ('AMD', 'MSFT'), ('AMD', 'V'), ('MA', 'MSFT'), ('MA', 'NVDA'), ('MA', 'ORCL'), ('MA', 'V'), ('MSFT', 'V')]
# Function to perform MLR and Dickey-Fuller test on cointegrated pairs
def basket_testing(cointegrated_pairs, data):
results = []
for pair in cointegrated_pairs:
stock1, stock2 = pair
y = data[stock1]
X = data[stock2]
X = sm.add_constant(X) # Adding a constant term to the model
# Perform MLR
model = sm.OLS(y, X).fit()
residuals = model.resid
# Perform Dickey-Fuller test on the residuals
adf_result = adfuller(residuals)
# Store relevant results
results.append({
'Pair': pair,
'MLR R-squared': model.rsquared,
'ADF Statistic': adf_result[0],
'ADF p-value': adf_result[1]
})
return results
# Example usage:
results = basket_testing(pairs, closing_prices)
# Print the results for each pair
for result in results:
print(f"Pair: {result['Pair']}")
print(f"MLR R-squared: {result['MLR R-squared']:.4f}")
print(f"ADF Statistic: {result['ADF Statistic']:.4f}")
print(f"ADF p-value: {result['ADF p-value']:.4f}")
print("\n" + "="*50 + "\n")
Pair: ('ADBE', 'CRM')
MLR R-squared: 0.7713
ADF Statistic: -3.3803
ADF p-value: 0.0116
==================================================
Pair: ('AMD', 'MA')
MLR R-squared: 0.7973
ADF Statistic: -3.8978
ADF p-value: 0.0021
==================================================
Pair: ('AMD', 'MSFT')
MLR R-squared: 0.9189
ADF Statistic: -4.0061
ADF p-value: 0.0014
==================================================
Pair: ('AMD', 'V')
MLR R-squared: 0.7817
ADF Statistic: -3.8457
ADF p-value: 0.0025
==================================================
Pair: ('MA', 'MSFT')
MLR R-squared: 0.8628
ADF Statistic: -4.4840
ADF p-value: 0.0002
==================================================
Pair: ('MA', 'NVDA')
MLR R-squared: 0.6493
ADF Statistic: -3.5706
ADF p-value: 0.0063
==================================================
Pair: ('MA', 'ORCL')
MLR R-squared: 0.7647
ADF Statistic: -3.7994
ADF p-value: 0.0029
==================================================
Pair: ('MA', 'V')
MLR R-squared: 0.9740
ADF Statistic: -4.6446
ADF p-value: 0.0001
==================================================
Pair: ('MSFT', 'V')
MLR R-squared: 0.8567
ADF Statistic: -3.4441
ADF p-value: 0.0095
==================================================
The plots for the actual price of the target stocks against the predicted prices from the MLR model for each cointegrated pair are shown below.
# Function to perform MLR, plot actual vs predicted prices
def plot_mlr_vs_actual_in_grids(cointegrated_pairs, data):
num_pairs = len(cointegrated_pairs)
num_rows = (num_pairs // 3) + (1 if num_pairs % 3 != 0 else 0)
fig, axes = plt.subplots(num_rows, 3, figsize=(18, num_rows * 6))
axes = axes.flatten()
for i, pair in enumerate(cointegrated_pairs):
stock1, stock2 = pair
y = data[stock1]
X = data[stock2]
X = sm.add_constant(X) # Adding a constant term to the model
# Perform MLR
model = sm.OLS(y, X).fit()
mlr_prediction = model.predict(X)
mlr_prediction.name = f"{stock1}_predicted_by_{stock2}"
# Plot actual vs predicted prices
pd.concat([y, mlr_prediction], axis=1).plot(ax=axes[i])
axes[i].set_title(f"{stock1} vs {mlr_prediction.name}")
axes[i].set_ylabel('Price Levels')
axes[i].legend([stock1, mlr_prediction.name])
# Hide any unused subplots
for j in range(i + 1, len(axes)):
fig.delaxes(axes[j])
plt.tight_layout()
plt.show()
plot_mlr_vs_actual_in_grids(pairs, closing_prices)
Portfolio Asset Allocation¶
For conducting portfolio analysis on the 15 filtered stocks, historical data of various asset classes such as indices and bonds have been added. The adjusted Close prices for stocks, bonds have been plotted below
- OLS regression analysis is conducted between the excess return of the stock basket over the risk-free rate (dependent variable) and the market return (independent variable represented by SPY ETF).
- A very high R-square of 91.7% suggests a strong relationship between the portfolio and the market
- A very high F-statistic of 13780 indicates that the overall regression model is statistically significant
- SPY cooeffiecint of 1.2009 is the beta of the portfolio which indicates that the portfolio is 20% more volatile than the market
stock_data = yf.download(['AAPL', 'MSFT', 'NVDA', 'AVGO', 'ORCL', 'CRM', 'ADBE', 'AMD', 'ACN', 'CSCO', 'BRK-B', 'JPM', 'V', 'MA', 'BAC'], start="2019-01-01")
index_data = yf.download(['SPY', '^GSPC','^DJI','^IXIC','^RUT', 'BIL'], start="2019-01-01")
bond_data = yf.download(['^IRX', '^FVX', '^TNX', 'SHY', 'LQD', 'BND', 'GOVT'], start="2019-01-01")
# etf_data = yf.download(['IVV','QQQ','XLV','VEA','EEM','FEZ','GLD','VNQ','ARKK'], start="2019-01-01")
[*********************100%%**********************] 15 of 15 completed [*********************100%%**********************] 6 of 6 completed [*********************100%%**********************] 7 of 7 completed
# Process stock data
prices_yf_adj_close = stock_data['Adj Close']
prices_yf_adj_close = prices_yf_adj_close.dropna()
# Process index data
prices_indx_close = index_data['Close']
prices_indx_close = prices_indx_close.dropna()
# Process bond data
USbond_yf_adj_close = bond_data['Adj Close']
USbond_yf_adj_close = USbond_yf_adj_close.dropna()
# Plot stock prices
plt.figure(figsize=(14, 7))
for c in prices_yf_adj_close.columns.values:
plt.plot(prices_yf_adj_close.index, prices_yf_adj_close[c], lw=3, alpha=0.8, label=c)
plt.legend(loc='upper left', fontsize=12)
plt.ylabel('Price in $')
plt.title('Adjusted Close Prices of Stocks')
plt.show()
# Plot bond prices
plt.figure(figsize=(14, 7))
for c in USbond_yf_adj_close.columns.values:
plt.plot(USbond_yf_adj_close.index, USbond_yf_adj_close[c], lw=3, alpha=0.8, label=c)
plt.legend(loc='upper left', fontsize=12)
plt.ylabel('Price in $')
plt.title('Adjusted Close Prices of Bonds')
plt.show()
# Define training and test date ranges
tr_start_date = '2019-01-01'
tr_end_date = '2023-12-31'
te_start_date = '2024-01-01'
te_end_date = '2024-08-01'
# Split the data into training and test sets
prices_yf_adj_close_train = prices_yf_adj_close.loc[tr_start_date:tr_end_date]
prices_yf_adj_close_test = prices_yf_adj_close.loc[te_start_date:te_end_date]
USbond_yf_adj_close_train = USbond_yf_adj_close.loc[tr_start_date:tr_end_date]
USbond_yf_adj_close_test = USbond_yf_adj_close.loc[te_start_date:te_end_date]
prices_indx_close_train = prices_indx_close.loc[tr_start_date:tr_end_date]
prices_indx_close_test = prices_indx_close.loc[te_start_date:te_end_date]
# Calculate returns
Returns = prices_yf_adj_close_train.pct_change().dropna()
# Calculate equally weighted returns
weights = np.ones(Returns.shape[1]) * 1 / Returns.shape[1]
df_wgts = pd.DataFrame(weights, index=Returns.columns, columns=['Weights'])
# Determine portfolio return based on equal weighted allocation
Basket_Ret = Returns.dot(df_wgts)
# Put results back into Return matrix
Returns['Basket_Ret'] = Basket_Ret
# Portfolio return for training period
Basket_R_train = Returns['Basket_Ret']
# Risk-free proxy
R_F = prices_indx_close_train['BIL'].pct_change()[1:]
# Market returns
M = prices_indx_close_train['SPY'].pct_change()[1:]
# Adjust the returns by subtracting the risk-free rate
Y = Basket_R_train - R_F.loc[Basket_R_train.index]
# Add a constant (intercept) to the explanatory variables
X = sm.add_constant(M.loc[Basket_R_train.index])
# Multiple Linear Regression
Basket_train_results = regression.linear_model.OLS(Y, X).fit()
# Beta of the basket
Basket_beta_train = Basket_train_results.params[1]
# Plot returns
plt.figure(figsize=(14, 7))
M.plot(label='Market (SPY)')
Basket_R_train.plot(label='Stock Basket Return')
R_F.plot(label='Risk-Free (BIL)')
plt.xlabel('Time')
plt.ylabel('Daily Percent Return')
plt.legend()
plt.show()
# Print regression summary
print(Basket_train_results.summary())
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.917
Model: OLS Adj. R-squared: 0.916
Method: Least Squares F-statistic: 1.378e+04
Date: Thu, 08 Aug 2024 Prob (F-statistic): 0.00
Time: 17:58:08 Log-Likelihood: 4925.2
No. Observations: 1257 AIC: -9846.
Df Residuals: 1255 BIC: -9836.
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 0.0004 0.000 2.964 0.003 0.000 0.001
SPY 1.2009 0.010 117.372 0.000 1.181 1.221
==============================================================================
Omnibus: 172.522 Durbin-Watson: 1.904
Prob(Omnibus): 0.000 Jarque-Bera (JB): 676.467
Skew: 0.611 Prob(JB): 1.28e-147
Kurtosis: 6.380 Cond. No. 75.4
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
The predicted returns and the predicted cumulative returns closely trail the actual return and the actual cumulative returns of stock Basket which is a good indicatior for our portfolio.
# Prediction for the test period
predictions = R_F_test + Basket_beta_train * (M_test - R_F_test) # CAPM equation
# Plot the predictions against actual returns
plt.figure(figsize=(14, 7))
predictions.plot(label='Prediction')
Basket_R_test.plot(label='Actual Return', color='y')
plt.legend(['Prediction', 'Actual Return'])
plt.xlabel('Time')
plt.ylabel('Daily Percent Return')
plt.title('Predicted vs Actual Returns of Stock Basket')
plt.show()
# Calculate cumulative returns
predictions_cum = predictions.cumsum() + 1
Basket_cum = Basket_R_test.cumsum() + 1
# Plot the cumulative returns
plt.figure(figsize=(14, 7))
predictions_cum.plot(label='Prediction_cum')
Basket_cum.plot(label='Actual Return_cum', color='b')
plt.legend(['Prediction_cum', 'Actual Return_cum'])
plt.xlabel('Time')
plt.ylabel('Cumulative Return')
plt.title('Predicted vs Actual Cumulative Returns of Stock Basket')
plt.show()
- The Maximum Sharpe Ratio Portfolio below has a sharpe ratio of 0.76, annualized return of 29% and annualized volatity of 22%, which gives us the highest return with a significant allocation to both high-growth stocks and some bond ETFs to balance risk.
- The Minimum Volatility Portfolio has an annualized return of 16% and annualized volatity of 16% and focuses on minimizing risk by allocating more towards bonds and stable stocks.
def portfolio_annualised_performance(weights, mean_returns, cov_matrix):
returns = np.sum(mean_returns*weights) * 252 # Number of trading days in the year
std = np.sqrt(np.dot(weights.T, np.dot(cov_matrix, weights))) * np.sqrt(252)
return std, returns
def random_portfolios(num_portfolios, returns, cov_matrix, risk_free_rate):
results = np.zeros((3,num_portfolios))
weights_record = []
mean_returns = returns.mean()
for i in range(num_portfolios):
weights = np.random.random(returns.shape[1])
weights /= np.sum(weights)
weights_record.append(weights)
portfolio_std_dev, portfolio_return = portfolio_annualised_performance(weights, mean_returns, cov_matrix)
results[0,i] = portfolio_std_dev
results[1,i] = portfolio_return
results[2,i] = (portfolio_return - risk_free_rate) / portfolio_std_dev # Sharpe Ratio
return results, weights_record
fyprices_yf_adj_close = prices_yf_adj_close.loc['2019-01-01':tr_end_date]
fyUSbond_yf_adj_close = USbond_yf_adj_close.loc['2019-01-01':tr_end_date]
# Equity Returns
EQreturns = fyprices_yf_adj_close.pct_change()
EQreturns = EQreturns.dropna()
mean_EQreturns = EQreturns.mean()
EQcov_matrix = EQreturns.cov()
# Money Market Returns
MMreturns = fyUSbond_yf_adj_close.pct_change()
MMreturns = MMreturns.dropna()
mean_MMreturns = MMreturns.mean()
MMcov_matrix = MMreturns.cov()
# Concatenating returns
# Place the DataFrames side by side
PortfolioRet = pd.concat([EQreturns, MMreturns], axis=1)
mean_PortfolioRet = PortfolioRet.mean()
cov_matrix = PortfolioRet.cov()
num_portfolios = 25000
risk_free_rate = 0.01
def display_simulated_ef_with_random(returns, cov_matrix, num_portfolios, risk_free_rate):
results, weights = random_portfolios(num_portfolios,returns, cov_matrix, risk_free_rate)
max_sharpe_idx = np.argmax(results[2])
print(results[2][max_sharpe_idx])
sdp, rp = results[0,max_sharpe_idx], results[1,max_sharpe_idx]
max_sharpe_allocation = pd.DataFrame(weights[max_sharpe_idx],index=returns.columns,columns=['allocation'])
max_sharpe_allocation.allocation = [round(i*100,2)for i in max_sharpe_allocation.allocation]
max_sharpe_allocation = max_sharpe_allocation.T
min_vol_idx = np.argmin(results[0])
sdp_min, rp_min = results[0,min_vol_idx], results[1,min_vol_idx]
min_vol_allocation = pd.DataFrame(weights[min_vol_idx],index=returns.columns,columns=['allocation'])
min_vol_allocation.allocation = [round(i*100,2)for i in min_vol_allocation.allocation]
min_vol_allocation = min_vol_allocation.T
print ("-"*80)
print ("Maximum Sharpe Ratio Portfolio Allocation\n")
print ("Annualised Return:", round(rp,2))
print ("Annualised Volatility:", round(sdp,2))
print ("\n")
print (max_sharpe_allocation)
print ("-"*80)
print ("Minimum Volatility Portfolio Allocation\n")
print ("Annualised Return:", round(rp_min,2))
print ("Annualised Volatility:", round(sdp_min,2))
print ("\n")
print (min_vol_allocation)
plt.figure(figsize=(15, 10))
plt.scatter(results[0,:],results[1,:],c=results[2,:],cmap='YlGnBu', marker='o', s=10, alpha=0.3)
plt.colorbar()
plt.scatter(sdp,rp,marker='*',color='r',s=500, label='Maximum Sharpe ratio')
plt.scatter(sdp_min,rp_min,marker='*',color='g',s=500, label='Minimum volatility')
plt.title('Simulated Portfolio Optimization based on Efficient Frontier')
plt.xlabel('annualised volatility')
plt.ylabel('annualised returns')
plt.legend(labelspacing=0.8)
return results, rp , sdp, rp_min, sdp_min
results, rp , sdp, rp_min, sdp_min = display_simulated_ef_with_random(PortfolioRet, cov_matrix, num_portfolios, risk_free_rate)
1.2015685257156334 -------------------------------------------------------------------------------- Maximum Sharpe Ratio Portfolio Allocation Annualised Return: 0.28 Annualised Volatility: 0.22 Ticker AAPL ACN ADBE AMD AVGO BAC BRK-B CRM CSCO JPM ... \ allocation 7.01 9.48 1.5 7.16 6.85 6.05 2.71 0.77 0.67 7.68 ... Ticker NVDA ORCL V BND GOVT LQD SHY ^FVX ^IRX ^TNX allocation 9.25 2.31 0.16 5.6 6.19 1.06 8.66 7.81 0.48 1.57 [1 rows x 22 columns] -------------------------------------------------------------------------------- Minimum Volatility Portfolio Allocation Annualised Return: 0.18 Annualised Volatility: 0.16 Ticker AAPL ACN ADBE AMD AVGO BAC BRK-B CRM CSCO JPM ... \ allocation 6.02 3.23 0.61 0.16 4.36 1.27 0.31 2.91 3.92 9.01 ... Ticker NVDA ORCL V BND GOVT LQD SHY ^FVX ^IRX ^TNX allocation 4.99 2.53 2.34 9.84 10.6 11.45 8.46 1.02 0.57 2.58 [1 rows x 22 columns]
# exclude 13 Wk treasury Bills , as very volatile
exclude2 = ['^IRX']
#exclude2 = []
fyUSbond_yf_adj_close = fyUSbond_yf_adj_close.loc[:, fyUSbond_yf_adj_close.columns.difference(exclude2)]
# Redo Equity Returns
EQreturns = fyprices_yf_adj_close.pct_change()
mean_EQreturns = EQreturns.mean()
EQcov_matrix = EQreturns.cov()
# Money Market Returns
MMreturns = fyUSbond_yf_adj_close.pct_change()
mean_MMreturns = MMreturns.mean()
MMcov_matrix = MMreturns.cov()
# Concatenating returns
# Place the DataFrames side by side
PortfolioRet = pd.concat([EQreturns, MMreturns], axis=1)
mean_PortfolioRet = PortfolioRet.mean()
cov_matrix = PortfolioRet.cov()
results, rp , sdp, rp_min, sdp_min = display_simulated_ef_with_random(PortfolioRet, cov_matrix, num_portfolios, risk_free_rate)
1.2285121027596466 -------------------------------------------------------------------------------- Maximum Sharpe Ratio Portfolio Allocation Annualised Return: 0.29 Annualised Volatility: 0.22 Ticker AAPL ACN ADBE AMD AVGO BAC BRK-B CRM CSCO JPM ... \ allocation 9.4 4.46 3.86 9.7 6.59 0.45 1.47 2.11 0.43 5.99 ... Ticker MSFT NVDA ORCL V BND GOVT LQD SHY ^FVX ^TNX allocation 1.09 10.37 3.25 1.24 7.61 3.73 5.7 7.73 5.54 3.83 [1 rows x 21 columns] -------------------------------------------------------------------------------- Minimum Volatility Portfolio Allocation Annualised Return: 0.16 Annualised Volatility: 0.16 Ticker AAPL ACN ADBE AMD AVGO BAC BRK-B CRM CSCO JPM ... \ allocation 6.21 3.5 0.46 1.08 0.01 1.48 9.18 4.19 7.11 4.91 ... Ticker MSFT NVDA ORCL V BND GOVT LQD SHY ^FVX ^TNX allocation 7.48 0.87 8.27 2.79 8.13 7.67 7.26 9.32 1.45 4.53 [1 rows x 21 columns]
Efficient Frontier¶
The Efficient Frontier below helps identify the best possible portfolios that can be constructed with our basket. We simulate multiple random portfolios and calculate their risk and return characteristics. The code then visualizes the Efficient Frontier along with the simulated portfolios, highlighting the optimal portfolios
- It's observed that while the annualized return has dropped from 29% to 11%, the sharpe ratio has also reduced from 0.76 to 0.64 and the annualized volatity has drastically reduced from 22% to 7%
- however 34% allaocation has been given to BOND which I may tweak lower to increase the annualized returns as part of my investment strategy.
def neg_sharpe_ratio(weights, mean_returns, cov_matrix, risk_free_rate):
p_var, p_ret = portfolio_annualised_performance(weights, mean_returns, cov_matrix)
return -(p_ret - risk_free_rate) / p_var
def max_sharpe_ratio(mean_returns, cov_matrix, risk_free_rate):
num_assets = len(mean_returns)
args = (mean_returns, cov_matrix, risk_free_rate)
constraints = ({'type': 'eq', 'fun': lambda x: np.sum(x) - 1})
bound = (0.0,1.0)
bounds = tuple(bound for asset in range(num_assets))
result = sco.minimize(neg_sharpe_ratio, num_assets*[1./num_assets,], args=args,
method='SLSQP', bounds=bounds, constraints=constraints)
return result
def portfolio_volatility(weights, mean_returns, cov_matrix):
return portfolio_annualised_performance(weights, mean_returns, cov_matrix)[0]
def min_variance(mean_returns, cov_matrix):
num_assets = len(mean_returns)
args = (mean_returns, cov_matrix)
constraints = ({'type': 'eq', 'fun': lambda x: np.sum(x) - 1})
bound = (0.0,1.0)
bounds = tuple(bound for asset in range(num_assets))
result = sco.minimize(portfolio_volatility, num_assets*[1./num_assets,], args=args,
method='SLSQP', bounds=bounds, constraints=constraints)
return result
def efficient_return(mean_returns, cov_matrix, target):
num_assets = len(mean_returns)
args = (mean_returns, cov_matrix)
def portfolio_return(weights):
return portfolio_annualised_performance(weights, mean_returns, cov_matrix)[1]
constraints = ({'type': 'eq', 'fun': lambda x: portfolio_return(x) - target},
{'type': 'eq', 'fun': lambda x: np.sum(x) - 1})
bounds = tuple((0,1) for asset in range(num_assets))
result = sco.minimize(portfolio_volatility, num_assets*[1./num_assets,], args=args, method='SLSQP', bounds=bounds, constraints=constraints)
#print(result)
return result
def efficient_frontier(mean_returns, cov_matrix, returns_range):
efficients = []
for ret in returns_range:
efficients.append(efficient_return(mean_returns, cov_matrix, ret))
return efficients
def display_calculated_ef_with_random(returns, cov_matrix, num_portfolios, risk_free_rate):
results, _ = random_portfolios(num_portfolios,returns, cov_matrix, risk_free_rate)
mean_returns = returns.mean()
max_sharpe = max_sharpe_ratio(mean_returns, cov_matrix, risk_free_rate)
sdp, rp = portfolio_annualised_performance(max_sharpe['x'], mean_returns, cov_matrix)
max_sharpe_allocation = pd.DataFrame(max_sharpe.x,index=returns.columns,columns=['allocation'])
max_sharpe_allocation.allocation = [round(i*100,2)for i in max_sharpe_allocation.allocation]
max_sharpe_allocation = max_sharpe_allocation.T
max_sharpe_allocation
min_vol = min_variance(mean_returns, cov_matrix)
sdp_min, rp_min = portfolio_annualised_performance(min_vol['x'], mean_returns, cov_matrix)
min_vol_allocation = pd.DataFrame(min_vol.x,index=returns.columns,columns=['allocation'])
min_vol_allocation.allocation = [round(i*100,2)for i in min_vol_allocation.allocation]
min_vol_allocation = min_vol_allocation.T
print ("-"*80)
print ("Maximum Sharpe Ratio Portfolio Allocation\n")
print ("Annualised Return:", round(rp,2))
print ("Annualised Volatility:", round(sdp,2))
print ("\n")
print (max_sharpe_allocation)
print ("-"*80)
print ("Minimum Volatility Portfolio Allocation\n")
print ("Annualised Return:", round(rp_min,2))
print ("Annualised Volatility:", round(sdp_min,2))
print ("\n")
print (min_vol_allocation)
plt.figure(figsize=(15, 10))
plt.scatter(results[0,:],results[1,:],c=results[2,:],cmap='YlGnBu', marker='o', s=20, alpha=0.3)
plt.colorbar()
plt.scatter(sdp,rp,marker='*',color='r',s=500, label='Maximum Sharpe ratio')
plt.scatter(sdp_min,rp_min,marker='*',color='g',s=500, label='Minimum volatility')
target = np.linspace(rp_min, 0.32, 50)
efficient_portfolios = efficient_frontier(mean_returns, cov_matrix, target)
plt.plot([p['fun'] for p in efficient_portfolios], target, linestyle='-.', color='black', label='efficient frontier')
#plt.title('Calculated Portfolio Optimization based on Efficient Frontier')
plt.title('Portfolio Optimization based on Efficient Frontier')
plt.xlabel('annualised volatility')
plt.ylabel('annualised returns')
plt.legend(labelspacing=0.8)
display_calculated_ef_with_random(PortfolioRet, cov_matrix, num_portfolios, risk_free_rate)
-------------------------------------------------------------------------------- Maximum Sharpe Ratio Portfolio Allocation Annualised Return: 0.11 Annualised Volatility: 0.07 Ticker AAPL ACN ADBE AMD AVGO BAC BRK-B CRM CSCO JPM ... \ allocation 6.35 0.0 0.0 0.11 1.85 0.0 0.0 0.0 0.0 0.0 ... Ticker MSFT NVDA ORCL V BND GOVT LQD SHY ^FVX ^TNX allocation 0.0 8.0 0.0 0.0 0.0 34.22 0.0 44.96 4.5 0.0 [1 rows x 21 columns] -------------------------------------------------------------------------------- Minimum Volatility Portfolio Allocation Annualised Return: 0.02 Annualised Volatility: 0.02 Ticker AAPL ACN ADBE AMD AVGO BAC BRK-B CRM CSCO JPM ... MSFT \ allocation 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 Ticker NVDA ORCL V BND GOVT LQD SHY ^FVX ^TNX allocation 0.0 0.0 0.0 0.0 0.0 0.0 98.58 0.83 0.6 [1 rows x 21 columns]
def display_ef_with_selected(returns, cov_matrix, risk_free_rate):
mean_returns = returns.mean()
max_sharpe = max_sharpe_ratio(mean_returns, cov_matrix, risk_free_rate)
sdp, rp = portfolio_annualised_performance(max_sharpe['x'], mean_returns, cov_matrix)
max_sharpe_allocation = pd.DataFrame(max_sharpe.x,index=returns.columns,columns=['allocation'])
max_sharpe_allocation.allocation = [round(i*100,2)for i in max_sharpe_allocation.allocation]
max_sharpe_allocation = max_sharpe_allocation.T
max_sharpe_allocation
min_vol = min_variance(mean_returns, cov_matrix)
sdp_min, rp_min = portfolio_annualised_performance(min_vol['x'], mean_returns, cov_matrix)
min_vol_allocation = pd.DataFrame(min_vol.x,index=returns.columns,columns=['allocation'])
min_vol_allocation.allocation = [round(i*100,2)for i in min_vol_allocation.allocation]
min_vol_allocation = min_vol_allocation.T
an_vol = np.std(returns) * np.sqrt(252)
an_rt = mean_returns * 252
print ("-"*80)
print ("Maximum Sharpe Ratio Portfolio Allocation\n")
print ("Annualised Return:", round(rp,2))
print ("Annualised Volatility:", round(sdp,2))
print ("\n")
print (max_sharpe_allocation)
print ("-"*80)
print ("Minimum Volatility Portfolio Allocation\n")
print ("Annualised Return:", round(rp_min,2))
print ("Annualised Volatility:", round(sdp_min,2))
print ("\n")
print (min_vol_allocation)
print ("-"*80)
print ("Individual Stock Returns and Volatility\n")
for i, txt in enumerate(returns.columns):
print (txt,":","annualised return",round(an_rt[i],2),", annualised volatility:",round(an_vol[i],2))
print ("-"*80)
fig, ax = plt.subplots(figsize=(22, 15))
ax.scatter(an_vol,an_rt,marker='o',s=700)
for i, txt in enumerate(returns.columns):
ax.annotate(txt, (an_vol[i],an_rt[i]), xytext=(18,0), textcoords='offset points')
ax.scatter(sdp,rp,marker='*',color='r',s=500, label='Maximum Sharpe ratio')
ax.scatter(sdp_min,rp_min,marker='*',color='g',s=500, label='Minimum volatility')
target = np.linspace(rp_min, 0.45, 70)
efficient_portfolios = efficient_frontier(mean_returns, cov_matrix, target)
ax.plot([p['fun'] for p in efficient_portfolios], target, linestyle='-.', color='black', label='efficient frontier')
ax.set_title('Portfolio Optimization with Individual Stocks')
ax.set_xlabel('annualised volatility')
ax.set_ylabel('annualised returns')
ax.legend(labelspacing=0.8)
display_ef_with_selected(PortfolioRet, cov_matrix, risk_free_rate)
-------------------------------------------------------------------------------- Maximum Sharpe Ratio Portfolio Allocation Annualised Return: 0.11 Annualised Volatility: 0.07 Ticker AAPL ACN ADBE AMD AVGO BAC BRK-B CRM CSCO JPM ... \ allocation 6.35 0.0 0.0 0.11 1.85 0.0 0.0 0.0 0.0 0.0 ... Ticker MSFT NVDA ORCL V BND GOVT LQD SHY ^FVX ^TNX allocation 0.0 8.0 0.0 0.0 0.0 34.22 0.0 44.96 4.5 0.0 [1 rows x 21 columns] -------------------------------------------------------------------------------- Minimum Volatility Portfolio Allocation Annualised Return: 0.02 Annualised Volatility: 0.02 Ticker AAPL ACN ADBE AMD AVGO BAC BRK-B CRM CSCO JPM ... MSFT \ allocation 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 Ticker NVDA ORCL V BND GOVT LQD SHY ^FVX ^TNX allocation 0.0 0.0 0.0 0.0 0.0 0.0 98.58 0.83 0.6 [1 rows x 21 columns] -------------------------------------------------------------------------------- Individual Stock Returns and Volatility AAPL : annualised return 0.38 , annualised volatility: 0.32 ACN : annualised return 0.24 , annualised volatility: 0.28 ADBE : annualised return 0.26 , annualised volatility: 0.37 AMD : annualised return 0.56 , annualised volatility: 0.54 AVGO : annualised return 0.4 , annualised volatility: 0.37 BAC : annualised return 0.15 , annualised volatility: 0.36 BRK-B : annualised return 0.14 , annualised volatility: 0.22 CRM : annualised return 0.21 , annualised volatility: 0.38 CSCO : annualised return 0.1 , annualised volatility: 0.28 JPM : annualised return 0.19 , annualised volatility: 0.32 MA : annualised return 0.22 , annualised volatility: 0.32 MSFT : annualised return 0.32 , annualised volatility: 0.3 NVDA : annualised return 0.67 , annualised volatility: 0.52 ORCL : annualised return 0.23 , annualised volatility: 0.31 V : annualised return 0.18 , annualised volatility: 0.28 BND : annualised return 0.01 , annualised volatility: 0.07 GOVT : annualised return 0.01 , annualised volatility: 0.06 LQD : annualised return 0.03 , annualised volatility: 0.11 SHY : annualised return 0.01 , annualised volatility: 0.02 ^FVX : annualised return 0.38 , annualised volatility: 0.78 ^TNX : annualised return 0.27 , annualised volatility: 0.64 --------------------------------------------------------------------------------
Diversification Factor¶
We achive a Portfolio Diversification Factor (PDF) of 8.53 indicating there's a high level of diversification in the portfolio. The portfolioās volatility is 11.7% of what it would be if the portfolio was not diversified at all. Thecombination of assets in our portfolio significantly reduces overall risk which is a positive indicator.
def fetch_data(tickers, start_date, end_date=None):
data = yf.download(tickers, start=start_date, end=end_date)['Adj Close']
return data
def calculate_diversification_factor(weights, correlation_matrix):
# Ensure weights and correlation matrix are numpy arrays
weights = np.array(weights)
correlation_matrix = np.array(correlation_matrix)
# Calculate portfolio variance
portfolio_variance = np.dot(weights.T, np.dot(correlation_matrix, weights))
# Calculate the weighted variance (assuming all variances are 1 for simplicity)
weighted_variance = np.sum(weights ** 2)
# Diversification factor
diversification_factor = portfolio_variance / weighted_variance
return diversification_factor
# filtered stocks selected for analysis
tickers = ['AAPL', 'MSFT', 'NVDA', 'AVGO', 'ORCL', 'CRM', 'ADBE', 'AMD', 'ACN', 'CSCO', 'BRK-B', 'JPM', 'V', 'MA', 'BAC']
start_date = '2019-01-01'
weights = [1/len(tickers) for i in tickers]
# Fetch historical data
data = fetch_data(tickers, start_date)
# Calculate diversification factor
pdf = calculate_diversification_factor(weights, correlation_matrix)
print("Portfolio Diversification Factor:", pdf)
[*********************100%%**********************] 15 of 15 completed
Portfolio Diversification Factor: 8.526933322445922
Value at Risk; Conditional Value at Risk¶
Our constructed portfolio portfolio is at risk of losing up to 1.66% (Portfolio VaR Return) or USD 16,648.59 on 95% of trading days. However, in the worst 5% of cases, if losses do occur, the average loss could be 2.18% (Portfolio CVaR Return) or USD 21,823.51. We can conclude that our portfolio risk is minimal.
def value_at_risk(value_invested, returns, weights, alpha=0.95, lookback_days=780):
returns = returns.fillna(0.0)
portfolio_returns = returns.iloc[-lookback_days:].dot(weights)
var = np.percentile(portfolio_returns, (1 - alpha) * 100)
return value_invested * var
def cvar(value_invested, returns, weights, alpha=0.95, lookback_days=780):
var = value_at_risk(value_invested, returns, weights, alpha, lookback_days=lookback_days)
returns = returns.fillna(0.0)
portfolio_returns = returns.iloc[-lookback_days:].dot(weights)
var_pct_loss = var / value_invested
cvar_value = value_invested * np.nanmean(portfolio_returns[portfolio_returns < var_pct_loss])
return cvar_value
tickers = ['AAPL', 'MSFT', 'NVDA', 'AVGO', 'ORCL', 'CRM', 'ADBE', 'AMD', 'ACN', 'CSCO', 'BRK-B', 'JPM', 'V', 'MA', 'BAC']
start_date = '2019-01-01'
value_invested = 1000000
# Fetch historical data
data = fetch_data(tickers, start_date)
Returns = data.pct_change().dropna()
# Initialize and normalize weights
weights = np.ones((len(tickers), 1))
weights = weights / np.sum(weights)
# Split the data into training and test sets
Ret_tr = Returns.loc[tr_start_date:tr_end_date]
Ret_te = Returns.loc[te_start_date:te_end_date]
lookback_days = 780
alpha = 0.95
# Calculate portfolio returns for test set
portfolio_returns_te = Ret_te.fillna(0.0).iloc[-lookback_days:].dot(weights)
# Calculate VaR and CVaR
portfolio_VaR_te = value_at_risk(value_invested, Ret_te, weights, alpha=alpha)
portfolio_VaR_return_te = portfolio_VaR_te / value_invested
portfolio_CVaR_te = cvar(value_invested, Ret_te, weights, alpha=alpha)
portfolio_CVaR_return_te = portfolio_CVaR_te / value_invested
[*********************100%%**********************] 15 of 15 completed
print(f"Portfolio VaR Return (95%): {portfolio_VaR_return_te * 100:.2f}%")
print(f"Portfolio VaR Amount (95%): ${portfolio_VaR_te:.2f}")
print(f"Portfolio CVaR Return (95%): {portfolio_CVaR_return_te * 100:.2f}%")
print(f"Portfolio CVaR Amount (95%): ${portfolio_CVaR_te:.2f}")
plt.hist(portfolio_returns_te[portfolio_returns_te > portfolio_VaR_return_te] * 100, bins=30, color='green')
plt.hist(portfolio_returns_te[portfolio_returns_te < portfolio_VaR_return_te] * 100, bins=15, color='orange')
plt.axvline(portfolio_VaR_return_te * 100, color='blue', linestyle='solid')
plt.axvline(portfolio_CVaR_return_te * 100, color='red', linestyle='dashed')
plt.legend(['VaR for Specified Alpha as a Return',
'CVaR for Specified Alpha as a Return',
'Historical Returns Distribution',
'Returns < VaR'])
plt.title('Historical 95% VaR and CVaR')
plt.xlabel('Return in %')
plt.ylabel('Observation Frequency')
plt.show()
Portfolio VaR Return (95%): -1.66% Portfolio VaR Amount (95%): $-16648.59 Portfolio CVaR Return (95%): -2.18% Portfolio CVaR Amount (95%): $-21823.51
Volatility, Hedged Portfolio PnL¶
For risk management and further optimizing our portfolio, we remove MPC from the portfolio as its the only filtered stock from the Energy sector in the basket. Then we calculate residuals and betas of stocks within the two remaining sectors (Technology and Finance) relative to the overall market and their respective sector benchmarks.
- The covariance matrix of the residuals is calculated using the Ledoit-Wolf shrinkage
- The average pairwise correlation of our portfolio is only 0.0118 indicating that the stocks within our portfolio have very low correlation and are diversified
- The Beta of the technology stocks basket is 1.3582, beta of the finance stocks basket is 1.1040; and the overall portfolio beta is 1.2492
start_date = '2019-01-01'
SectorHedgeBasket = ['ADBE', 'GS', 'MSFT', 'MA', 'ACN', 'CRM', 'BRK-B', 'AMD', 'AXP', 'AAPL', 'V', 'JPM', 'AVGO', 'NVDA', 'XLK', 'SPY', 'WFC']
prices_SHedgeB = fetch_data(SectorHedgeBasket, start_date)
# Calculate returns
rets = prices_SHedgeB.pct_change().dropna()
[*********************100%%**********************] 17 of 17 completed
# Get market hedge ticker
mkt = ['SPY']
# Get sector hedge tickers
sector_1_hedge = ['XLK'] # Technology
sector_2_hedge = ['WFC'] # Finance
# Identify securities for each sector
sector_1_stocks = ['ADBE', 'MSFT', 'ACN', 'CRM', 'AMD', 'AAPL', 'AVGO', 'NVDA'] # Technology stocks
sector_2_stocks = ['GS', 'MA', 'BRK-B', 'AXP', 'V', 'JPM'] # Finance stocks
# Extract returns
market = rets[mkt]
sector_1_rets = rets[sector_1_hedge]
sector_2_rets = rets[sector_2_hedge]
stock_rets = rets.drop(['XLK', 'SPY', 'WFC'], axis=1)
residuals_market = stock_rets.copy() * 0
residuals = stock_rets.copy() * 0
betas = {}
# Calculate market beta of sector 1 benchmark
model = sm.OLS(sector_1_rets.values, market.values)
results = model.fit()
sector_1_excess = results.resid
# Calculate market beta of sector 2 benchmark
model = sm.OLS(sector_2_rets.values, market.values)
results = model.fit()
sector_2_excess = results.resid
# Calculate residuals and betas for sector 1 stocks
for stock in sector_1_stocks:
model = sm.OLS(stock_rets[stock], market.values)
results = model.fit()
betas[stock] = results.params[0]
residuals_market[stock] = results.resid
model = sm.OLS(residuals_market[stock], sector_1_excess)
results = model.fit()
residuals[stock] = results.resid
# Calculate residuals and betas for sector 2 stocks
for stock in sector_2_stocks:
model = sm.OLS(stock_rets[stock], market.values)
results = model.fit()
betas[stock] = results.params[0]
residuals_market[stock] = results.resid
model = sm.OLS(residuals_market[stock], sector_2_excess)
results = model.fit()
residuals[stock] = results.resid
# Get covariance of residuals
lw_cov = LedoitWolf().fit(residuals).covariance_
# Extract correlation from covariance
def extract_corr_from_cov(cov):
d = np.sqrt(np.diag(cov))
corr = cov / np.outer(d, d)
corr[cov == 0] = 0
return corr
# Create cumulative returns
Cumret2 = 100 * (1 + residuals).cumprod()
# Plot cumulative returns and correlation heatmap
fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(14, 7))
fig.tight_layout(pad=3.0)
# Plot cumulative returns
Cumret2.plot(ax=ax1)
ax1.set_ylabel("Cumulative Return (%)")
ax1.set_title("Cumulative Returns")
# Plot heatmap of correlations
corr = extract_corr_from_cov(lw_cov)
sns.heatmap(corr, ax=ax2, fmt='d', vmin=-1, vmax=1, xticklabels=stock_rets.columns, yticklabels=stock_rets.columns)
ax2.set_title("Correlation Heatmap")
plt.show()
# Calculate average pairwise correlation
average_corr = np.mean(corr[np.triu_indices_from(corr, k=1)])
print('Average pairwise correlation: %.4f' % average_corr)
Average pairwise correlation: 0.0118
# Calculate and plot PnL for the hedged portfolio
weights = np.ones(len(stock_rets.columns)) / len(stock_rets.columns)
portfolio_returns = (residuals @ weights).cumsum()
initial_investment = 100000
pnl = initial_investment * (1 + portfolio_returns)
plt.figure(figsize=(14, 7))
pnl.plot(title='Hedged Portfolio PnL')
plt.xlabel('Time')
plt.ylabel('PnL ($)')
plt.show()
# Calculate the average beta for each sector
avg_beta_sector_1 = np.mean([betas[stock] for stock in sector_1_stocks])
avg_beta_sector_2 = np.mean([betas[stock] for stock in sector_2_stocks])
print(f'Beta of the Technology Stocks Basket: {avg_beta_sector_1:.4f}')
print(f'Beta of the Finance Stocks Basket: {avg_beta_sector_2:.4f}')
# Calculate the overall portfolio beta
portfolio_beta = sum(weights[i] * betas[stock] for i, stock in enumerate(stock_rets.columns))
print(f'Overall Portfolio Beta: {portfolio_beta:.4f}')
Beta of the Technology Stocks Basket: 1.3582 Beta of the Finance Stocks Basket: 1.1040 Overall Portfolio Beta: 1.2492